Hello,

On Wednesday 03 May 2006 16:42, Andreas Koch wrote:
> Hi all,
>
> we have have hit upon a conundrum in our (so far very satisfying) use of
> Bacula.
>
> Our central file server crashed while Bacula 1.38.4 was writing Job 150 to
>
> tape:
> | 150   | DefaultBackup       | 2006-04-27 04:05:02 | B    | D     | 0
> | 0           | R         |
>
> After restarting (and updating the kernel [different story]), we manually
> reran the backup jobs to the same tape (Thursday-0001), which completed
>
> successfully:
> | 152   | DefaultBackup       | 2006-04-27 14:19:14 | B    | D     | 221770
> | 5623183006  | T         |
> | 153   | BackupCatalog       | 2006-04-27 14:53:12 | B    | F     | 1
> | 939894373   | T         |
>
> Note that there is no entry for job 151, which would have been the Catalog
> backup for Job 150.
>
> However, attempting to restore any job past the failed job 150 is no longer
> possible. Here's a log for a restore attempt of job 152
>
> 03-May 14:09 erebor-dir: Start Restore Job RestoreFiles.2006-05-03_14.09.14
> 03-May 14:09 erebor-sd: Ready to read from volume "Thursday-0001" on device
> "LTO-2" (/dev/nst0).
> 03-May 14:09 erebor-sd: Forward spacing to file:block 276:0.
> 03-May 14:10 erebor-sd: RestoreFiles.2006-05-03_14.09.14 Error: block.c:263
> Volume data error at 276:8671! Wanted ID: "BB02", got "*". Buffer
> discarded. 03-May 14:10 erebor-dir: RestoreFiles.2006-05-03_14.09.14 Error:
> Bacula 1.38.8 (14Apr06): 03-May-2006 14:10:57
>
> Similarly, a bls -j on the volume Thursday-0001 ends with:
>
> Begin Job Session Record:
> JobId             : 150
> VerNum            : 11
> PoolName          : ThursdayTapePool
> PoolType          : Backup
> JobName           : DefaultBackup
> ClientName        : erebor-fd
> Job (unique name) : DefaultBackup.2006-04-27_04.05.00
> FileSet           : FullSet
> JobType           : B
> JobLevel          : D
> Date written      : 27-Apr-2006 09:58
> 02-May 19:06 bls: Got EOF at file 272  on device "LTO-2" (/dev/nst0),
> Volume "Thursday-0001"
> 02-May 19:07 bls: Got EOF at file 273  on device "LTO-2" (/dev/nst0),
> Volume "Thursday-0001"
> 02-May 19:07 bls: Got EOF at file 274  on device "LTO-2" (/dev/nst0),
> Volume "Thursday-0001"
> 02-May 19:07 bls: Got EOF at file 275  on device "LTO-2" (/dev/nst0),
> Volume "Thursday-0001"
> 02-May 19:07 bls: Got EOF at file 276  on device "LTO-2" (/dev/nst0),
> Volume "Thursday-0001"
> 02-May 19:07 bls: bls Error: block.c:263 Volume data error at 276:8671!
> Wante d ID: "BB02", got "*". Buffer discarded.
>  Bacula status: file=276 block=8671
>  Device status: WR_PROT ONLINE IM_REP_EN file=276 block=8672
>
> A rather desperate attempt to fix the media record with bscan -m (now
> updated to 1.38.8) also fails with:
>
> bscan: bscan.c:487 SOS_LABEL: Found Job record for JobId: 0
> 03-May 16:02 bscan: End of file 271  on device "LTO-2" (/dev/nst0), Volume
> "Thursday-0001"
> bscan: bscan.c:487 SOS_LABEL: Found Job record for JobId: 0
> bscan: bscan.c:669 4,784,128 file records. At file:blk=271:3,760
> bytes=234,529,489,158
> bscan: bscan.c:669 4,816,896 file records. At file:blk=271:6,236
> bytes=234,689,085,090
> 03-May 16:02 bscan: End of file 272  on device "LTO-2" (/dev/nst0), Volume
> "Thursday-0001"
> 03-May 16:03 bscan: End of file 273  on device "LTO-2" (/dev/nst0), Volume
> "Thursday-0001"
> bscan: bscan.c:669 4,849,664 file records. At file:blk=273:36,561
> bytes=236,644,324,571
> bscan: bscan.c:669 4,882,432 file records. At file:blk=273:43,492
> bytes=237,091,261,422
> 03-May 16:03 bscan: End of file 274  on device "LTO-2" (/dev/nst0), Volume
> "Thursday-0001"
> 03-May 16:03 bscan: End of file 275  on device "LTO-2" (/dev/nst0), Volume
> "Thursday-0001"
> 03-May 16:03 bscan: End of file 276  on device "LTO-2" (/dev/nst0), Volume
> "Thursday-0001"
> bscan: bscan.c:669 4,915,200 file records. At file:blk=276:82,665
> bytes=239,616,955,619
> bscan: bscan.c:669 4,947,968 file records. At file:blk=276:83,527
> bytes=239,672,522,633
> bscan: bscan.c:669 4,980,736 file records. At file:blk=276:84,596
> bytes=239,741,454,494
> 03-May 16:04 bscan: bscan Error: block.c:263 Volume data error at 276:8671!
> Wanted ID: "BB02", got "*". Buffer discarded.
>  Bacula status: file=276 block=8671
>  Device status: WR_PROT ONLINE IM_REP_EN file=276 block=8672
> Records would have been added or updated in the catalog:
>       1 Media
>       1 Pool
>      69 Job
> 4982796 File
>
> There are at least two aspects to all this. First, how can we continue to
> use this tape (backups to it still run) and actually _restore_ data from
> it? 

You definitely should not continue to append to this Volume.  Once you are 
sure you have all the data from it, it would be better to physically erase 
the volume, delete it from the catalog, test it with btape "fill" and if it 
is good, relabel it.

> Second, if the data base has gotten so wedged that restores are no 
> longer possible, Bacula should refrain from indicating successful backups
> to the volume, and instead mark it as `Error'.

It probably wouldn't be such a bad idea for Bacula to mark the Volume Status 
"Error" if it gets an error reading the tape and the status is Append. That 
would prevent it from being appended to again.  However, during normal append 
of a Volume, Bacula does not read the tape (unless you have a defective OS 
where Bacula must slowly advance to the end of the tape), but rather uses the 
OS EOD command (position to end of data), which does not normally read the 
tape, only the tape marks.

>
> Any ideas how to resolve this?

If you don't *absolutely* need the data from that particular tape, you should 
change the Volume status to Error and do a full backup to a fresh tape.  If 
you need the data, you can first try running the Storage daemon with the -p 
option on the command line and see if it can get past the bad spot.  It is 
unlikely but possible.  If that works, great.  If using the -p doesn't work, 
you will need to see if the error lies in the data you want or before it.  If 
it is before the data you want, you can most likely use the techniques 
reported by a user recently to construct a new tape by coping the first few 
files of the bad tape, forward spacing to your data, then copying it.  Then 
you should be able to use bextract to extract what you want.  This is a bit 
technical and complicated, and not something I can help you with in detail. 
Perhaps someone else can.

However an outline of what you need to do would be:

In bconsole, do the following:

sql
select * from JobMedia where JobId=150;

where I assume you want to recover the data from JobId 150.  There may be only 
one line that prints, or multiple lines.  The StartFile from the first line 
that prints will be the file number on the tape where the Job data begins. 
The EndFile is the last file on the tape where the Job data exists.  Those 
would be the tape files that you want to recover.  By looking at the output 
listing above, we know that the bad data is in File 276, so by comparing the 
two you will know what is good and what is bad.  Then it is just a matter of 
spending a lot of time and trying to recover the data.  Be aware that there 
can always be a problem with counting file numbers from base 0 or base 1, so 
if in doubt always try to take one more file than you think you need.

If you print the .bsr file produced during the restore command just before 
responding to the yes/mod/no question, you will see that it contains the same 
information, but in much more detail.


The following is probably not what you want to hear now, but for the others on 
this list.  

If your OS crashes while Bacula is writing at Volume, especially if it is due 
to a power failure, you should *always* take that Volume out of service and 
assume that the last file and possibly more is unreadable (i.e. examine the 
tape, and redo the appropriate backups as soon as possible to another tape).  

If Bacula crashes, but the OS continues running the tape should be OK.


-- 
Best regards,

Kern

  (">
  /\
  V_V


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to