[Bacula-users] Various problems during recovery

Jacek Konieczny Tue, 19 Apr 2005 07:05:24 -0700

Hello,


I have set up backups of several of my machines via Bacula months ago.
And that works quite well. At the very beginning I have also done a test
recovery. That was full recovery of a small server and worked perfect.

However, on Friday we got a real-life problem -- disk failure on
a production server. Some of data had to be restored from a bacula
backup. We have the last full backup on volume W0006 and the last
incremental backup on volume K0003. It seemed the recovery will be is.
But it was not.

We didn't want to recover everything at once, because we didn't have the
new disk yet. We started a restore job only for files from the main
filesystem (/, /usr, other were safe on the other disk). 

The server with failed disk was AMD64, to restore the root filesystem we
have installed the target disk temporarily in a x86 workstation and
configured bacula-fd on the workstation as it was configured on the
server.

----------- problem 1. -----------

When the restore job (most recent backup for a client) was started
Bacula requested tape K0003 (the incremental one -- IMHO strange). The
most recent files on the backup were restored, then we got error:

15-Apr 11:29 master-sd: End of Volume at file 169 on device /dev/nst0, Volume 
"K0003"

and a bit later:

15-Apr 11:31 master-sd: RestoreFiles.2005-04-15_11.25.50 Error: block.c:782 
Read error at file:blk 0:0 on device /dev/nst0. ERR=Input/output error.

Bacula didn't ask us for the second tape, but the SD seemed to be locked
in "Drive is initializing"/"device open, but not mounted" state. I was
not able to change the tape manually or do any mount/umount command.

In dmesg we got:
Apr 15 11:31:40 master kernel: scsi0:0:6:0: Attempting to abort cmd f7a5fb00: 
0x1e 0x0 0x0 0x0 0x1 0x0
Apr 15 11:31:40 master kernel: scsi0: At time of recovery, card was not paused
Apr 15 11:31:40 master kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins 
<<<<<<<<<<<<<<<<<
Apr 15 11:31:40 master kernel: scsi0: Dumping Card State at program address 0x2 
Mode 0x22
[...]
Apr 15 11:31:40 master kernel: DevQ(0:6:0): 0 waiting
Apr 15 11:31:40 master kernel: (scsi0:A:6:0): Device is disconnected, 
re-queuing SCB
Apr 15 11:31:40 master kernel: Recovery code sleeping
Apr 15 11:31:40 master kernel: (scsi0:A:6:0): Abort Message Sent
Apr 15 11:31:40 master kernel: (scsi0:A:6:0): SCB 3 - Abort Tag Completed.
Apr 15 11:31:40 master kernel: Recovery SCB completes
Apr 15 11:31:40 master kernel: found == 0x1
Apr 15 11:31:40 master kernel: Recovery code awake
Apr 15 11:31:42 master kernel: st0: Error with sense data: Current st0: sense = 
70  2
Apr 15 11:31:42 master kernel: ASC=3a ASCQ= 0
Apr 15 11:31:42 master kernel: Raw sense data:0x70 0x00 0x02 0x00 0x00 0x00 
0x00 0x0a 0x00 0x00 0x00 0x00
0x3a 0x00 0x00 0x00 0x00 0x00

The last three lines were repeated many times.


----------- problem 2. ------------

The most recent files were recovered, so we started another restore job for the
rest of the data, which was on tape W0006. It was going well until the restore 
was terminated
with following errors:

serwus-fd: brw-rw----   1 root     disk           0 2004-02-04 11:22:51  
/tmp/serwus/dev/rd/c1d2
15-Apr 13:08 master-sd: RestoreFiles.2005-04-15_13.05.48 Fatal error: 
read.c:117 Error sending to File daemon. ERR=Connection reset by peer
serwus-fd: brw-rw----   1 root     disk           0 2004-02-04 11:22:51  
/tmp/serwus/dev/rd/c1d20
serwus-fd: brw-rw----   1 root     disk           0 2004-02-04 11:22:51  
/tmp/serwus/dev/rd/c1d20p1
serwus-fd: brw-rw----   1 root     disk           0 2004-02-04 11:22:51  
/tmp/serwus/dev/rd/c1d20p2
serwus-fd: brw-rw----   1 root     disk           0 2004-02-04 11:22:51  
/tmp/serwus/dev/rd/c1d20p3
15-Apr 13:08 serwus-fd: RestoreFiles.2005-04-15_13.05.48 Fatal error: Record 
header file index 45111 not equal record index 44420
15-Apr 13:08 master-sd: RestoreFiles.2005-04-15_13.05.48 Fatal error: 
read.c:117 Error sending to File daemon. ERR=Connection reset by peer
(above line repeated many times)
15-Apr 13:08 master-sd: RestoreFiles.2005-04-15_13.05.48 Error: bnet.c:406 
Write error sending 32 bytes to client:213.227.67.33:36643: ERR=Connection 
reset by peer
15-Apr 13:08 master-dir: RestoreFiles.2005-04-15_13.05.48 Error: Bacula 1.36.2 
(28Feb05): 15-Apr-2005 13:08:41

Then we started the restore job again, skipping the files already restored, and
the problematic "/dev/rd/" directory. That job finished without problems.

----------- problem 3., related ---

When all of the files were restored we found out, that permissions on the
directories restored during the terminated job are insane. E.g. 744 on /usr,
or 777 on /usr/local/bin (that is a _big_security_hole_ after permissions on
upper level directories are fixed). 

I managed to fix those permissions (thanks to RPM) and the base services of the 
server could be restored.


Yesterday we got the new disk and wanted to restore the remaining files. This 
time Bacula asked for 
the full-backup (W0006) tape first, but after several hours of the restore
process we got teh problem 2. again.

serwus-fd: crw-rw----   1 root     disk           0 2004-02-04 15:51:58  
/vol/3/dev/nzqft3
serwus-fd: crw-rw----   1 root     sys            0 2004-02-04 15:51:58  
/vol/3/dev/oldjs0
18-Apr 16:57 master-sd: RestoreFiles.2005-04-18_13.49.51 Fatal error: 
read.c:117 Error sending to File daemon. ERR=Connection reset by peer
(above line repeated many times)
erwus-fd: crw-rw----   1 root     sys            0 2004-02-04 15:51:58  
/vol/3/dev/oldjs1
serwus-fd: crw-rw----   1 root     sys            0 2004-02-04 15:51:58  
/vol/3/dev/oldjs2
serwus-fd: crw-rw----   1 root     sys            0 2004-02-04 15:51:58  
/vol/3/dev/oldjs3
serwus-fd: brw-rw----   1 root     disk           0 2004-02-04 15:51:58  
/vol/3/dev/optcd
serwus-fd: crw-rw----   1 root     disk           0 2004-02-04 15:51:58  
/vol/3/dev/osst0
serwus-fd: crw-rw----   1 root     disk           0 2004-02-04 15:51:58  
/vol/3/dev/osst1
18-Apr 16:57 serwus-fd: RestoreFiles.2005-04-18_13.49.51 Fatal error: Record 
header file index 461344 not equal record index 0
18-Apr 16:57 master-sd: RestoreFiles.2005-04-18_13.49.51 Fatal error: 
read.c:117 Error sending to File daemon. ERR=Connection reset by peer
(above line repeated many times)
18-Apr 16:57 master-sd: RestoreFiles.2005-04-18_13.49.51 Error: bnet.c:374 
Write error sending len to client:213.227.67.1:36643: ERR=Connection reset by 
peer
18-Apr 16:57 master-dir: RestoreFiles.2005-04-18_13.49.51 Error: Bacula 1.36.2 
(28Feb05): 18-Apr-2005 16:57:54

----------- problem 4. -----

Today I started new restore, containing only a part of the files to restore,
and not including any /dev directory. That went fine until the end of the tape
and, guess what, Bacula asked for the second tape:

19-Apr 11:32 master-sd: End of Volume at file 34 on device /dev/nst0, Volume 
"W0006"
19-Apr 11:34 master-sd: RestoreFiles.2005-04-19_09.41.11 Warning: Wrong Volume 
mounted on device /dev/nst0: Wanted K0003 have W0006
19-Apr 11:34 master-sd: Please mount Volume "K0003" on Storage Device "DDS-4" 
for Job RestoreFiles.2005-04-19_09.41.11

It seemed this time the restore job would fully succeed, but after mounting the 
tape we got:

19-Apr 11:40 master-sd: Forward spacing to file:block 124:0.
serwus-fd: -rw-rw----   1 root     utmp      447928 2005-04-13 00:14:01  
/vol/3/var/log/lastlog
19-Apr 12:12 master-sd: RestoreFiles.2005-04-19_09.41.11 Fatal error: 
read.c:132 Error sending to File daemon. ERR=Connection timed out
19-Apr 12:12 master-sd: RestoreFiles.2005-04-19_09.41.11 Error: bnet.c:406 
Write error sending 32768 bytes to client:213.227.67.1:36643: ERR=Connection 
timed out
19-Apr 12:31 serwus-fd: RestoreFiles.2005-04-19_09.41.11 Fatal error: 
restore.c:125 Data record error. ERR=Interrupted system call
19-Apr 12:31 master-dir: Bacula 1.36.2 (28Feb05): 19-Apr-2005 12:31:14

I had to start separate job to restore files from the incremental backup only.

--------- problem 6. ------

When Bacula restores all the requested files from a job it still reads
gigabytes of data from the tape until it reaches end of the saved job. Is that
really neccessary. Doesn't it know that all selected files were restored and it
can proceed with another volume?


Now I have most of the files restored (one job is still running), but that took
much more of my time, that it should. If that is how Bacula work, that it is 
very far from being
a good backup/restore utility.


Greets,
        Jacek


-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users

[Bacula-users] Various problems during recovery

Reply via email to