Hello,
I have set up backups of several of my machines via Bacula months ago. And that works quite well. At the very beginning I have also done a test recovery. That was full recovery of a small server and worked perfect. However, on Friday we got a real-life problem -- disk failure on a production server. Some of data had to be restored from a bacula backup. We have the last full backup on volume W0006 and the last incremental backup on volume K0003. It seemed the recovery will be is. But it was not. We didn't want to recover everything at once, because we didn't have the new disk yet. We started a restore job only for files from the main filesystem (/, /usr, other were safe on the other disk). The server with failed disk was AMD64, to restore the root filesystem we have installed the target disk temporarily in a x86 workstation and configured bacula-fd on the workstation as it was configured on the server. ----------- problem 1. ----------- When the restore job (most recent backup for a client) was started Bacula requested tape K0003 (the incremental one -- IMHO strange). The most recent files on the backup were restored, then we got error: 15-Apr 11:29 master-sd: End of Volume at file 169 on device /dev/nst0, Volume "K0003" and a bit later: 15-Apr 11:31 master-sd: RestoreFiles.2005-04-15_11.25.50 Error: block.c:782 Read error at file:blk 0:0 on device /dev/nst0. ERR=Input/output error. Bacula didn't ask us for the second tape, but the SD seemed to be locked in "Drive is initializing"/"device open, but not mounted" state. I was not able to change the tape manually or do any mount/umount command. In dmesg we got: Apr 15 11:31:40 master kernel: scsi0:0:6:0: Attempting to abort cmd f7a5fb00: 0x1e 0x0 0x0 0x0 0x1 0x0 Apr 15 11:31:40 master kernel: scsi0: At time of recovery, card was not paused Apr 15 11:31:40 master kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< Apr 15 11:31:40 master kernel: scsi0: Dumping Card State at program address 0x2 Mode 0x22 [...] Apr 15 11:31:40 master kernel: DevQ(0:6:0): 0 waiting Apr 15 11:31:40 master kernel: (scsi0:A:6:0): Device is disconnected, re-queuing SCB Apr 15 11:31:40 master kernel: Recovery code sleeping Apr 15 11:31:40 master kernel: (scsi0:A:6:0): Abort Message Sent Apr 15 11:31:40 master kernel: (scsi0:A:6:0): SCB 3 - Abort Tag Completed. Apr 15 11:31:40 master kernel: Recovery SCB completes Apr 15 11:31:40 master kernel: found == 0x1 Apr 15 11:31:40 master kernel: Recovery code awake Apr 15 11:31:42 master kernel: st0: Error with sense data: Current st0: sense = 70 2 Apr 15 11:31:42 master kernel: ASC=3a ASCQ= 0 Apr 15 11:31:42 master kernel: Raw sense data:0x70 0x00 0x02 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x3a 0x00 0x00 0x00 0x00 0x00 The last three lines were repeated many times. ----------- problem 2. ------------ The most recent files were recovered, so we started another restore job for the rest of the data, which was on tape W0006. It was going well until the restore was terminated with following errors: serwus-fd: brw-rw---- 1 root disk 0 2004-02-04 11:22:51 /tmp/serwus/dev/rd/c1d2 15-Apr 13:08 master-sd: RestoreFiles.2005-04-15_13.05.48 Fatal error: read.c:117 Error sending to File daemon. ERR=Connection reset by peer serwus-fd: brw-rw---- 1 root disk 0 2004-02-04 11:22:51 /tmp/serwus/dev/rd/c1d20 serwus-fd: brw-rw---- 1 root disk 0 2004-02-04 11:22:51 /tmp/serwus/dev/rd/c1d20p1 serwus-fd: brw-rw---- 1 root disk 0 2004-02-04 11:22:51 /tmp/serwus/dev/rd/c1d20p2 serwus-fd: brw-rw---- 1 root disk 0 2004-02-04 11:22:51 /tmp/serwus/dev/rd/c1d20p3 15-Apr 13:08 serwus-fd: RestoreFiles.2005-04-15_13.05.48 Fatal error: Record header file index 45111 not equal record index 44420 15-Apr 13:08 master-sd: RestoreFiles.2005-04-15_13.05.48 Fatal error: read.c:117 Error sending to File daemon. ERR=Connection reset by peer (above line repeated many times) 15-Apr 13:08 master-sd: RestoreFiles.2005-04-15_13.05.48 Error: bnet.c:406 Write error sending 32 bytes to client:213.227.67.33:36643: ERR=Connection reset by peer 15-Apr 13:08 master-dir: RestoreFiles.2005-04-15_13.05.48 Error: Bacula 1.36.2 (28Feb05): 15-Apr-2005 13:08:41 Then we started the restore job again, skipping the files already restored, and the problematic "/dev/rd/" directory. That job finished without problems. ----------- problem 3., related --- When all of the files were restored we found out, that permissions on the directories restored during the terminated job are insane. E.g. 744 on /usr, or 777 on /usr/local/bin (that is a _big_security_hole_ after permissions on upper level directories are fixed). I managed to fix those permissions (thanks to RPM) and the base services of the server could be restored. Yesterday we got the new disk and wanted to restore the remaining files. This time Bacula asked for the full-backup (W0006) tape first, but after several hours of the restore process we got teh problem 2. again. serwus-fd: crw-rw---- 1 root disk 0 2004-02-04 15:51:58 /vol/3/dev/nzqft3 serwus-fd: crw-rw---- 1 root sys 0 2004-02-04 15:51:58 /vol/3/dev/oldjs0 18-Apr 16:57 master-sd: RestoreFiles.2005-04-18_13.49.51 Fatal error: read.c:117 Error sending to File daemon. ERR=Connection reset by peer (above line repeated many times) erwus-fd: crw-rw---- 1 root sys 0 2004-02-04 15:51:58 /vol/3/dev/oldjs1 serwus-fd: crw-rw---- 1 root sys 0 2004-02-04 15:51:58 /vol/3/dev/oldjs2 serwus-fd: crw-rw---- 1 root sys 0 2004-02-04 15:51:58 /vol/3/dev/oldjs3 serwus-fd: brw-rw---- 1 root disk 0 2004-02-04 15:51:58 /vol/3/dev/optcd serwus-fd: crw-rw---- 1 root disk 0 2004-02-04 15:51:58 /vol/3/dev/osst0 serwus-fd: crw-rw---- 1 root disk 0 2004-02-04 15:51:58 /vol/3/dev/osst1 18-Apr 16:57 serwus-fd: RestoreFiles.2005-04-18_13.49.51 Fatal error: Record header file index 461344 not equal record index 0 18-Apr 16:57 master-sd: RestoreFiles.2005-04-18_13.49.51 Fatal error: read.c:117 Error sending to File daemon. ERR=Connection reset by peer (above line repeated many times) 18-Apr 16:57 master-sd: RestoreFiles.2005-04-18_13.49.51 Error: bnet.c:374 Write error sending len to client:213.227.67.1:36643: ERR=Connection reset by peer 18-Apr 16:57 master-dir: RestoreFiles.2005-04-18_13.49.51 Error: Bacula 1.36.2 (28Feb05): 18-Apr-2005 16:57:54 ----------- problem 4. ----- Today I started new restore, containing only a part of the files to restore, and not including any /dev directory. That went fine until the end of the tape and, guess what, Bacula asked for the second tape: 19-Apr 11:32 master-sd: End of Volume at file 34 on device /dev/nst0, Volume "W0006" 19-Apr 11:34 master-sd: RestoreFiles.2005-04-19_09.41.11 Warning: Wrong Volume mounted on device /dev/nst0: Wanted K0003 have W0006 19-Apr 11:34 master-sd: Please mount Volume "K0003" on Storage Device "DDS-4" for Job RestoreFiles.2005-04-19_09.41.11 It seemed this time the restore job would fully succeed, but after mounting the tape we got: 19-Apr 11:40 master-sd: Forward spacing to file:block 124:0. serwus-fd: -rw-rw---- 1 root utmp 447928 2005-04-13 00:14:01 /vol/3/var/log/lastlog 19-Apr 12:12 master-sd: RestoreFiles.2005-04-19_09.41.11 Fatal error: read.c:132 Error sending to File daemon. ERR=Connection timed out 19-Apr 12:12 master-sd: RestoreFiles.2005-04-19_09.41.11 Error: bnet.c:406 Write error sending 32768 bytes to client:213.227.67.1:36643: ERR=Connection timed out 19-Apr 12:31 serwus-fd: RestoreFiles.2005-04-19_09.41.11 Fatal error: restore.c:125 Data record error. ERR=Interrupted system call 19-Apr 12:31 master-dir: Bacula 1.36.2 (28Feb05): 19-Apr-2005 12:31:14 I had to start separate job to restore files from the incremental backup only. --------- problem 6. ------ When Bacula restores all the requested files from a job it still reads gigabytes of data from the tape until it reaches end of the saved job. Is that really neccessary. Doesn't it know that all selected files were restored and it can proceed with another volume? Now I have most of the files restored (one job is still running), but that took much more of my time, that it should. If that is how Bacula work, that it is very far from being a good backup/restore utility. Greets, Jacek ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Bacula-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-users
