This document contains the technical details of Bug #395.
Bacula bug #935 reports that during a restore, a large number of files are
missing and thus not restored. This is really quite surprising because we
have a fairly extensive regression test suite that explicitly tests for this
kind of problem many times.
Despite our testing, there is indeed a bug in Bacula that has the following
characteristics:
1. It happens only when multiple simultaneous Jobs are run (regardless of
whether or not data spooling is enabled), and happens only when the
Storage daemon is changing from one Volume to another -- i.e. the
backups span multiple volumes.
2. It has only been observed on disk based backup, but not on tape.
3. Under the right circumstances (timing), it could and probably does happen
on tape backups.
4. It seems to be timing dependent, and requires multiple clients to
reproduce, although under the right circumstances, it should be reproducible
with a single client doing multiple simultaneous backups.
5. Analysis indicates that it happens most often when the clients are slow
(e.g. doing Incremental backups).
6. It has been verified to exist in versions 2.0.x and 2.2.x.
7. It should also be in version 1.38, but could not be reproduced in testing,
perhaps due to timing considerations or the fact that the test FD daemons
were version 2.2.2.
8. The data is correctly stored on the Volume, but incorrect index (JobMedia)
records are stored in the database. (the JobMedia record generated during
the Volume change contains the index of the new Volume rather than the
previous Volume). This will be described in more detail below.
9. You can prevent the problem from occurring by either turning off multiple
simultaneous Jobs or by ensuring that while running multiple simultaneous
Jobs that those Jobs do not span Volumes. E.g. you could manually mark
Volumes as full when they are sufficiently large.
10. If you are not running multiple simultaneous Jobs, you will not be
affected by this bug.
11. If you are running multiple simultaneous Jobs to tapes, I believe there is
a reasonable probability that this problem could show up when Jobs are split
across tapes.
12. If you are running multiple simultaneous Jobs to disks, I believe there is
a high probability that this problem will show up when Jobs are split across
disks Volumes.
===============================
The problem comes from the fact that when the end of a Volume is reached,
the SD must generate a JobMedia (index) record for each of the Jobs that is
currently running. Since each job is in a separate thread, the thread that
does the Volume switch marks all the other threads (Jobs) with a flag
that tell them to update the catalog index (JobMedia). Sometime later,
when that thread attempts to do another write to the volume, it will
create a JobMedia record.
This all works pefectly well, with one silly exception that was overlooked
and that is the Volume Id (MediaId) may have changed by the time the Job
gets around to generating the JobMedia record, and the code in all
versions prior to 2.2.3 use a "global" Volume index, which could point
to the new volume rather than the old volume. This doesn't seem to happen
very often, otherwise it would have shown up in our regression tests. I
suspect that it happens when the SD is very busy (has lots of clients
feeding it) because at that point the Volume can be changed before some
of the Jobs get around to trying to write again.
The result of the above is that if Bacula writes to Tape001, then switches
to write to Tape002, any Job that doesn't update his index files (JobMedia
record) before the new Volume starts to write will have the last group
of files marked as being on Volume Tape002 rather than Tape001, and
when the restore runs, it obviously will not find the files.
The solution to the problem was quite simple: when the SD is doing its
record keeping (saving the current file locations) into a local copy
it must also save the current volume id, then when the JobMedia record
is written it will point to the right volume.
Perhaps with time we will have a better idea how to detect and correct
this problem, but for the moment, I can only show you some examples of
failed jobs and sucessful jobs and give some indications how to "manually"
recover the files.
First the output showing a failure:
===================== Failure ==========================
10-Sep 08:21 rufus-dir: Start Backup JobId 1, Job=roxie.2007-09-10_08.21.21
10-Sep 08:21 rufus-dir: Using Device "FileStorage1"
10-Sep 08:21 rufus-sd: Wrote label to prelabeled Volume "TestVolume001" on
device "FileStorage1" (/home/kern/bacula/xregress/tmp)
10-Sep 08:21 rufus-dir: Start Backup JobId 2, Job=lmatou.2007-09-10_08.21.22
10-Sep 08:21 rufus-dir: Using Device "FileStorage1"
10-Sep 08:21 rufus-dir: Start Backup JobId 3,
Job=Multi-client.2007-09-10_08.21.23
10-Sep 08:21 rufus-dir: Using Device "FileStorage1"
10-Sep 08:21 rufus-sd: Job write elapsed time = 00:00:23, Transfer rate =
5.572 M bytes/second
10-Sep 08:21 rufus-dir: Bacula rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:21:47
Build OS: i686-pc-linux-gnu redhat
JobId: 1
Job: roxie.2007-09-10_08.21.21
Backup Level: Full
Client: "roxie-fd" 2.2.1 (30Aug07)
i686-pc-linux-gnu,suse,10.2
FileSet: "roxie-FileSet" 2007-09-10 08:21:21
Pool: "Default" (From Job resource)
Storage: "File1" (From command line)
Scheduled time: 10-Sep-2007 08:21:21
Start time: 10-Sep-2007 08:21:22
End time: 10-Sep-2007 08:21:47
Elapsed time: 25 secs
Priority: 10
FD Files Written: 7,400
SD Files Written: 7,400
FD Bytes Written: 127,110,275 (127.1 MB)
SD Bytes Written: 128,178,908 (128.1 MB)
Rate: 5084.4 KB/s
Software Compression: None
VSS: no
Encryption: no
Volume name(s): TestVolume001
Volume Session Id: 1
Volume Session Time: 1189405277
Last Volume Bytes: 302,945,847 (302.9 MB)
Non-fatal FD errors: 0
SD Errors: 0
FD termination status: OK
SD termination status: OK
Termination: Backup OK
10-Sep 08:21 rufus-dir: Begin pruning Jobs.
10-Sep 08:21 rufus-dir: No Jobs found to prune.
10-Sep 08:21 rufus-dir: Begin pruning Files.
10-Sep 08:21 rufus-dir: No Files found to prune.
10-Sep 08:21 rufus-dir: End auto prune.
10-Sep 08:22 rufus-sd: User defined maximum volume capacity 900,000,000
exceeded on device "FileStorage1" (/home/kern/bacula/xregress/tmp).
10-Sep 08:22 rufus-sd: End of medium on Volume "TestVolume001"
Bytes=899,939,860 Blocks=13,950 at 10-Sep-2007 08:22.
10-Sep 08:22 rufus-sd: Wrote label to prelabeled Volume "TestVolume002" on
device "FileStorage1" (/home/kern/bacula/xregress/tmp)
10-Sep 08:22 rufus-sd: New volume "TestVolume002" mounted on
device "FileStorage1" (/home/kern/bacula/xregress/tmp) at 10-Sep-2007 08:22.
10-Sep 08:22 rufus-sd: Job write elapsed time = 00:01:20, Transfer rate =
7.982 M bytes/second
10-Sep 08:22 rufus-dir: Bacula rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:22:51
Build OS: i686-pc-linux-gnu redhat
JobId: 2
Job: lmatou.2007-09-10_08.21.22
Backup Level: Full
Client: "lmatou-fd" 2.2.0 (08Aug07)
i686-pc-linux-gnu,redhat,
FileSet: "lmatou-FileSet" 2007-09-10 08:21:22
Pool: "Default" (From Job resource)
Storage: "File1" (From command line)
Scheduled time: 10-Sep-2007 08:21:21
Start time: 10-Sep-2007 08:21:23
End time: 10-Sep-2007 08:22:51
Elapsed time: 1 min 28 secs
Priority: 10
FD Files Written: 15,236
SD Files Written: 15,236
FD Bytes Written: 636,426,379 (636.4 MB)
SD Bytes Written: 638,628,613 (638.6 MB)
Rate: 7232.1 KB/s
Software Compression: None
VSS: no
Encryption: no
Volume name(s): TestVolume001|TestVolume002
Volume Session Id: 2
Volume Session Time: 1189405277
Last Volume Bytes: 222,311,478 (222.3 MB)
Non-fatal FD errors: 0
SD Errors: 0
FD termination status: OK
SD termination status: OK
Termination: Backup OK
10-Sep 08:22 rufus-dir: Begin pruning Jobs.
10-Sep 08:22 rufus-dir: No Jobs found to prune.
10-Sep 08:22 rufus-dir: Begin pruning Files.
10-Sep 08:22 rufus-dir: No Files found to prune.
10-Sep 08:22 rufus-dir: End auto prune.
10-Sep 08:23 rufus-sd: Job write elapsed time = 00:01:33, Transfer rate =
4.786 M bytes/second
10-Sep 08:23 rufus-dir: Bacula rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:23:02
Build OS: i686-pc-linux-gnu redhat
JobId: 3
Job: Multi-client.2007-09-10_08.21.23
Backup Level: Full
Client: "rufus-fd" 2.2.2 (06Aug07) i686-pc-linux-gnu,redhat,
FileSet: "Full Set" 2007-09-10 08:21:23
Pool: "Default" (From Job resource)
Storage: "File1" (From command line)
Scheduled time: 10-Sep-2007 08:21:22
Start time: 10-Sep-2007 08:21:27
End time: 10-Sep-2007 08:23:02
Elapsed time: 1 min 35 secs
Priority: 10
FD Files Written: 6,696
SD Files Written: 6,696
FD Bytes Written: 444,252,164 (444.2 MB)
SD Bytes Written: 445,181,110 (445.1 MB)
Rate: 4676.3 KB/s
Software Compression: None
VSS: no
Encryption: no
Volume name(s): TestVolume002
Volume Session Id: 3
Volume Session Time: 1189405277
Last Volume Bytes: 313,858,029 (313.8 MB)
Non-fatal FD errors: 0
SD Errors: 0
FD termination status: OK
SD termination status: OK
Termination: Backup OK
10-Sep 08:23 rufus-dir: Begin pruning Jobs.
10-Sep 08:23 rufus-dir: No Jobs found to prune.
10-Sep 08:23 rufus-dir: Begin pruning Files.
10-Sep 08:23 rufus-dir: No Files found to prune.
10-Sep 08:23 rufus-dir: End auto prune.
@#
@# now do a restore
@#
@tee tmp/log2.out
restore where=/home/kern/bacula/xregress/tmp/bacula-restores client=rufus-fd
select storage=File1
Automatically selected FileSet: Full Set
+-------+-------+----------+-------------+---------------------+---------------+
| JobId | Level | JobFiles | JobBytes | StartTime |
VolumeName |
+-------+-------+----------+-------------+---------------------+---------------+
| 3 | F | 6,696 | 444,252,164 | 2007-09-10 08:21:27 |
TestVolume002 |
+-------+-------+----------+-------------+---------------------+---------------+
You have selected the following JobId: 3
Building directory tree for JobId 3 ...
+++++++++++++++++++++++++++++++++++++++++++++
1 Job, 6,078 files inserted into the tree.
You are now entering file selection mode where you add (mark) and
remove (unmark) files to be restored. No files are initially added, unless
you used the "all" keyword on the command line.
Enter "done" to leave this mode.
cwd is: /
$ unmark *
6,696 files unmarked.
$ mark *
6,696 files marked.
$ done
Bootstrap records written
to /home/kern/bacula/xregress/working/rufus-dir.restore.1.bsr
The job will require the following
Volume(s) Storage(s) SD Device(s)
===========================================================================
TestVolume002 File1 FileStorage1
6,696 files selected to be restored.
Run Restore job
JobName: RestoreFiles
Bootstrap: /home/kern/bacula/xregress/working/rufus-dir.restore.1.bsr
Where: /home/kern/bacula/xregress/tmp/bacula-restores
Replace: always
FileSet: Full Set
Backup Client: rufus-fd
Restore Client: rufus-fd
Storage: File1
When: 2007-09-10 08:23:03
Catalog: MyCatalog
Priority: 10
OK to run? (yes/mod/no): yes
Job queued. JobId=4
wait
You have messages.
messages
10-Sep 08:23 rufus-dir: Start Restore Job RestoreFiles.2007-09-10_08.23.03
10-Sep 08:23 rufus-dir: Using Device "FileStorage1"
10-Sep 08:23 rufus-sd: Ready to read from volume "TestVolume002" on
device "FileStorage1" (/home/kern/bacula/xregress/tmp).
10-Sep 08:23 rufus-sd: Forward spacing Volume "TestVolume002" to file:block
0:64724.
10-Sep 08:23 rufus-fd: RestoreFiles.2007-09-10_08.23.03 Warning: Cannot change
owner and/or group
of
/home/kern/bacula/xregress/tmp/bacula-restores/home/kern/bacula/k/src/win32/stored/:
ERR=Operation not permitted
10-Sep 08:23 rufus-fd: RestoreFiles.2007-09-10_08.23.03 Error: attribs.c:442
Unable to set file
owner /home/kern/bacula/xregress/tmp/bacula-restores/home/kern/bacula/k/:
ERR=Operation not permitted
[ ... snip ... ]
10-Sep 08:23 rufus-dir: RestoreFiles.2007-09-10_08.23.03 Error: Bacula
rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:23:23
Build OS: i686-pc-linux-gnu redhat
JobId: 4
Job: RestoreFiles.2007-09-10_08.23.03
Restore Client: rufus-fd
Start time: 10-Sep-2007 08:23:05
End time: 10-Sep-2007 08:23:23
Files Expected: 6,696
Files Restored: 2,525
Bytes Restored: 188,395,392
Rate: 10466.4 KB/s
FD Errors: 240
FD termination status: Error
SD termination status: OK
Termination: *** Restore Error ***
10-Sep 08:23 rufus-dir: Begin pruning Jobs.
10-Sep 08:23 rufus-dir: No Jobs found to prune.
10-Sep 08:23 rufus-dir: Begin pruning Files.
10-Sep 08:23 rufus-dir: No Files found to prune.
10-Sep 08:23 rufus-dir: End auto prune.
!!!!! multi-client-test failed!!! !!!!!
!!!!! Bad Job termination status !!!!!
bstat=0 rstat=1 dstat=0
================= end failure example ==================
Now what happened above is that thre were three backup jobs run. JobId 3
Multi-client.2007-09-10_08.21.23 shows a successful backup to TestVolume002.
However, it *actually* wrote to TestVolume001 and TestVolume002. When the
restore is done, only TestVolume002 appears in the output, and only the files
that were on TestVolume002 were restored. This is because *all* the JobMedia
records indicate they were written on TestVolume002.
Now the output from the same test that succeeds (after the patch is applied)
looks like:
==================== begin successful backup/restore =============
10-Sep 12:37 rufus-dir: Start Backup JobId 1, Job=roxie.2007-09-10_12.37.07
10-Sep 12:37 rufus-dir: Using Device "FileStorage1"
10-Sep 12:37 rufus-sd: Wrote label to prelabeled Volume "TestVolume001" on
device "FileStorage1" (/home/kern/bacula/regress/tmp)
10-Sep 12:37 rufus-dir: Start Backup JobId 2, Job=lmatou.2007-09-10_12.37.08
10-Sep 12:37 rufus-dir: Start Backup JobId 3,
Job=Multi-client.2007-09-10_12.37.09
10-Sep 12:37 rufus-dir: Using Device "FileStorage1"
10-Sep 12:37 rufus-dir: Using Device "FileStorage1"
10-Sep 12:37 rufus-sd: Job write elapsed time = 00:00:25, Transfer rate =
5.127 M bytes/second
10-Sep 12:37 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:37:35
Build OS: i686-pc-linux-gnu redhat
JobId: 1
Job: roxie.2007-09-10_12.37.07
Backup Level: Full
Client: "roxie-fd" 2.2.1 (30Aug07)
i686-pc-linux-gnu,suse,10.2
FileSet: "roxie-FileSet" 2007-09-10 12:37:07
Pool: "Default" (From Job resource)
Storage: "File1" (From command line)
Scheduled time: 10-Sep-2007 12:37:06
Start time: 10-Sep-2007 12:37:08
End time: 10-Sep-2007 12:37:35
Elapsed time: 27 secs
Priority: 10
FD Files Written: 7,400
SD Files Written: 7,400
FD Bytes Written: 127,110,275 (127.1 MB)
SD Bytes Written: 128,178,908 (128.1 MB)
Rate: 4707.8 KB/s
Software Compression: None
VSS: no
Encryption: no
Volume name(s): TestVolume001
Volume Session Id: 1
Volume Session Time: 1189420623
Last Volume Bytes: 363,458,103 (363.4 MB)
Non-fatal FD errors: 0
SD Errors: 0
FD termination status: OK
SD termination status: OK
Termination: Backup OK
10-Sep 12:37 rufus-dir: Begin pruning Jobs.
10-Sep 12:37 rufus-dir: No Jobs found to prune.
10-Sep 12:37 rufus-dir: Begin pruning Files.
10-Sep 12:37 rufus-dir: No Files found to prune.
10-Sep 12:37 rufus-dir: End auto prune.
10-Sep 12:38 rufus-sd: User defined maximum volume capacity 900,000,000
exceeded on device "FileStorage1" (/home/kern/bacula/regress/tmp).
10-Sep 12:38 rufus-sd: End of medium on Volume "TestVolume001"
Bytes=899,939,867 Blocks=13,950 at 10-Sep-2007 12:38.
10-Sep 12:38 rufus-sd: Wrote label to prelabeled Volume "TestVolume002" on
device "FileStorage1" (/home/kern/bacula/regress/tmp)
10-Sep 12:38 rufus-sd: New volume "TestVolume002" mounted on
device "FileStorage1" (/home/kern/bacula/regress/tmp) at 10-Sep-2007 12:38.
10-Sep 12:38 rufus-sd: Job write elapsed time = 00:01:24, Transfer rate =
7.602 M bytes/second
10-Sep 12:38 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:38:40
Build OS: i686-pc-linux-gnu redhat
JobId: 2
Job: lmatou.2007-09-10_12.37.08
Backup Level: Full
Client: "lmatou-fd" 2.2.0 (08Aug07)
i686-pc-linux-gnu,redhat,
FileSet: "lmatou-FileSet" 2007-09-10 12:37:08
Pool: "Default" (From Job resource)
Storage: "File1" (From command line)
Scheduled time: 10-Sep-2007 12:37:07
Start time: 10-Sep-2007 12:37:09
End time: 10-Sep-2007 12:38:40
Elapsed time: 1 min 31 secs
Priority: 10
FD Files Written: 15,236
SD Files Written: 15,236
FD Bytes Written: 636,426,379 (636.4 MB)
SD Bytes Written: 638,628,613 (638.6 MB)
Rate: 6993.7 KB/s
Software Compression: None
VSS: no
Encryption: no
Volume name(s): TestVolume001|TestVolume002
Volume Session Id: 2
Volume Session Time: 1189420623
Last Volume Bytes: 274,050,095 (274.0 MB)
Non-fatal FD errors: 0
SD Errors: 0
FD termination status: OK
SD termination status: OK
Termination: Backup OK
10-Sep 12:38 rufus-dir: Begin pruning Jobs.
10-Sep 12:38 rufus-dir: No Jobs found to prune.
10-Sep 12:38 rufus-dir: Begin pruning Files.
10-Sep 12:38 rufus-dir: No Files found to prune.
10-Sep 12:38 rufus-dir: End auto prune.
10-Sep 12:38 rufus-sd: Job write elapsed time = 00:01:38, Transfer rate =
4.542 M bytes/second
10-Sep 12:38 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:38:48
Build OS: i686-pc-linux-gnu redhat
JobId: 3
Job: Multi-client.2007-09-10_12.37.09
Backup Level: Full
Client: "rufus-fd" 2.3.3 (06Sep07) i686-pc-linux-gnu,redhat,
FileSet: "Full Set" 2007-09-10 12:37:09
Pool: "Default" (From Job resource)
Storage: "File1" (From command line)
Scheduled time: 10-Sep-2007 12:37:08
Start time: 10-Sep-2007 12:37:09
End time: 10-Sep-2007 12:38:48
Elapsed time: 1 min 39 secs
Priority: 10
FD Files Written: 6,696
SD Files Written: 6,696
FD Bytes Written: 444,252,164 (444.2 MB)
SD Bytes Written: 445,181,110 (445.1 MB)
Rate: 4487.4 KB/s
Software Compression: None
VSS: no
Encryption: no
Volume name(s): TestVolume001|TestVolume002
Volume Session Id: 3
Volume Session Time: 1189420623
Last Volume Bytes: 313,858,022 (313.8 MB)
Non-fatal FD errors: 0
SD Errors: 0
FD termination status: OK
SD termination status: OK
Termination: Backup OK
10-Sep 12:38 rufus-dir: Begin pruning Jobs.
10-Sep 12:38 rufus-dir: No Jobs found to prune.
10-Sep 12:38 rufus-dir: Begin pruning Files.
10-Sep 12:38 rufus-dir: No Files found to prune.
10-Sep 12:38 rufus-dir: End auto prune.
@#
@# now do a restore
@#
@tee tmp/log2.out
restore where=/home/kern/bacula/regress/tmp/bacula-restores client=rufus-fd
select storage=File1
Automatically selected FileSet: Full Set
+-------+-------+----------+-------------+---------------------+---------------+
| JobId | Level | JobFiles | JobBytes | StartTime |
VolumeName |
+-------+-------+----------+-------------+---------------------+---------------+
| 3 | F | 6,696 | 444,252,164 | 2007-09-10 12:37:09 |
TestVolume001 |
| 3 | F | 6,696 | 444,252,164 | 2007-09-10 12:37:09 |
TestVolume002 |
+-------+-------+----------+-------------+---------------------+---------------+
You have selected the following JobId: 3
Building directory tree for JobId 3 ...
+++++++++++++++++++++++++++++++++++++++++++++
1 Job, 6,078 files inserted into the tree.
You are now entering file selection mode where you add (mark) and
remove (unmark) files to be restored. No files are initially added, unless
you used the "all" keyword on the command line.
Enter "done" to leave this mode.
cwd is: /
$ unmark *
6,696 files unmarked.
$ mark *
6,696 files marked.
$ done
Bootstrap records written
to /home/kern/bacula/regress/working/rufus-dir.restore.1.bsr
The job will require the following
Volume(s) Storage(s) SD Device(s)
===========================================================================
TestVolume001 File1 FileStorage1
TestVolume002 File1 FileStorage1
6,696 files selected to be restored.
Run Restore job
JobName: RestoreFiles
Bootstrap: /home/kern/bacula/regress/working/rufus-dir.restore.1.bsr
Where: /home/kern/bacula/regress/tmp/bacula-restores
Replace: always
FileSet: Full Set
Backup Client: rufus-fd
Restore Client: rufus-fd
Storage: File1
When: 2007-09-10 12:38:49
Catalog: MyCatalog
Priority: 10
OK to run? (yes/mod/no): yes
Job queued. JobId=4
wait
You have messages.
messages
10-Sep 12:38 rufus-dir: Start Restore Job RestoreFiles.2007-09-10_12.38.49
10-Sep 12:38 rufus-dir: Using Device "FileStorage1"
10-Sep 12:38 rufus-sd: Ready to read from volume "TestVolume001" on
device "FileStorage1" (/home/kern/bacula/regress/tmp).
10-Sep 12:38 rufus-sd: Forward spacing Volume "TestVolume001" to file:block
0:10644692.
10-Sep 12:39 rufus-sd: End of file 0 on device "FileStorage1"
(/home/kern/bacula/regress/tmp), Volume "TestVolume001"
10-Sep 12:39 rufus-sd: End of Volume at file 0 on device "FileStorage1"
(/home/kern/bacula/regress/tmp), Volume "TestVolume001"
10-Sep 12:39 rufus-sd: Ready to read from volume "TestVolume002" on
device "FileStorage1" (/home/kern/bacula/regress/tmp).
10-Sep 12:39 rufus-sd: Forward spacing Volume "TestVolume002" to file:block
0:212.
10-Sep 12:39 rufus-sd: End of file 0 on device "FileStorage1"
(/home/kern/bacula/regress/tmp), Volume "TestVolume002"
10-Sep 12:39 rufus-sd: End of Volume at file 0 on device "FileStorage1"
(/home/kern/bacula/regress/tmp), Volume "TestVolume002"
10-Sep 12:39 rufus-sd: End of all volumes.
10-Sep 12:39 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:39:52
Build OS: i686-pc-linux-gnu redhat
JobId: 4
Job: RestoreFiles.2007-09-10_12.38.49
Restore Client: rufus-fd
Start time: 10-Sep-2007 12:38:51
End time: 10-Sep-2007 12:39:52
Files Expected: 6,696
Files Restored: 6,696
Bytes Restored: 444,252,164
Rate: 7282.8 KB/s
FD Errors: 0
FD termination status: OK
SD termination status: OK
Termination: Restore OK
10-Sep 12:39 rufus-dir: Begin pruning Jobs.
10-Sep 12:39 rufus-dir: No Jobs found to prune.
10-Sep 12:39 rufus-dir: Begin pruning Files.
10-Sep 12:39 rufus-dir: No Files found to prune.
10-Sep 12:39 rufus-dir: End auto prune.
==================== end successful backup/restore ==========
Now if we dump the JobMedia records for the Job that failed to restore
properly, we get:
Enter SQL query: select * from JobMedia where JobId=3;
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+
| JobMediaId | JobId | MediaId | FirstIndex | LastIndex | StartFile |
EndFile | StartBlock | EndBlock | VolIndex | Copy | Stripe |
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+
| 2 | 3 | 2 | 1 | 4,324 | 0 |
0 | 57,867,465 | 899,939,859 | 1 | 0 | 0 |
| 5 | 3 | 2 | 4,324 | 6,696 | 0 |
0 | 212 | 313,858,028 | 2 | 0 | 0 |
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+
The key is to note that the MediaId is 2 for both of the records. That is
perfectly
legal, but in this case, the first record was actually written on MediaId=1.
It was just incorrectly recorded.
Below is an example from the same test that succeeded. Note, the first record
has MediaId=1 and the second has MediaId=2. Some of the details of the start
and
end block numbers are different, but that just shows that the timing was
differnent.
Enter SQL query: select * from JobMedia where JobId=3;
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+
| JobMediaId | JobId | MediaId | FirstIndex | LastIndex | StartFile |
EndFile | StartBlock | EndBlock | VolIndex | Copy | Stripe |
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+
| 3 | 3 | 1 | 1 | 4,198 | 0 |
0 | 46,771,401 | 899,230,227 | 1 | 0 | 0 |
| 5 | 3 | 2 | 4,198 | 6,696 | 0 |
0 | 64,724 | 313,858,028 | 2 | 0 | 0 |
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+
One aspect of this is to notice that if you sort the JobMedia records by
JobMediaId
as they are in the listing, for any given MediaId, the records *must* have
increasing
StartBlock and EndBlock and they must not overlap. In the case of the
successful
Job, the MediaIds are different, so we cannot compare Start and End Block
numbers, but
in the case of the failed Job, we see that there are two JobMedia records
for the same MediaId (perfectly OK), *but* the first one has
a larger Start block than the second record, which is not possible.
Now obviously, the simplest thing to do is to detect this condition and
correct the MediaId. However, this is not always so easy, because the Job
output does not *clearly* show what Volumes were written for what job, though
it can be deduced from looking at the full output, knowing when jobs start and
stop and looking at the times when Bacula changes Volumes. In addition, there
may be a very large number of JobMedia records if it is a big job.
Now for a look at the bsr files:
======= Unsuccessful bsr file ==================
# 10-Sep-2007 12:47:51 - Multi-client.2007-09-10_12.46.29 - Full
Volume="TestVolume002"
MediaType="File1"
VolSessionId=3
VolSessionTime=1189421182
VolFile=0-0
VolBlock=57867465-899939866
FileIndex=1-4337
Volume="TestVolume002"
MediaType="File1"
VolSessionId=3
VolSessionTime=1189421182
VolFile=0-0
VolBlock=64724-313858021
FileIndex=4337-6696
================================================
======== Successful bsr file ================
# 10-Sep-2007 12:38:48 - Multi-client.2007-09-10_12.37.09 - Full
Volume="TestVolume001"
MediaType="File1"
VolSessionId=3
VolSessionTime=1189420623
VolFile=0-0
VolBlock=10644692-899939866
FileIndex=1-4344
Volume="TestVolume002"
MediaType="File1"
VolSessionId=3
VolSessionTime=1189420623
VolFile=0-0
VolBlock=212-313858021
FileIndex=4344-6696
============================================
What is key in looking at these two files is that first and incorrect restore
lists TestVolume002 twice, whereas the second and successful restore lists
TestVolume001 and TestVolume002. So, if you understand the problem,
you can also patch things up for the restore by manually editing the
.bsr file when you get to the yes/mod/no prompt before you answer yes.
For example, in the unsucceful restore bsr file shown above, if I modify
the second line of the file to read:
Volume="TestVolume001"
then run the restore, it will work. Now obviously this is much more
complicated if there are a lot of volumes involved ...
===
Another way of correcting the problem, may be to run bscan. However, there
are a number of aspects of bscanning that I don't like:
1. If you use the old bscan, it may fail the same way the original backup
failed to write correct JobMedia records (I haven't tested it).
2. Bscanning is slow.
3. Bscan does not restore *all* the info in the catalog, so you will
be left with a catalog that is so-so.
4. Bscanning with a 2.2.3 version may work, but it is as of this
moment it is untested.
If you want to go the bscan route, I recommend saving a copy of your
database (just in case), but then instead of purging your database and
recreating it with bscan, I would simply create a different database and
if it works, use that one for restores.
Finally, if bscan does work, it may be possible to purge just the job that
fails to restore correctly, then bscanning only that job (probably using a
.bsr file). This minimize the loss of information stored in your database.
This has not been tried yet.
===
Before investigating bscan in more detail, I will release 2.2.3 and work
on getting a patch for 2.0.x systems.
This document can be found at www.bacula.org/downloads/bug-395.txt and will be
updated as the situation progresses.
Regards,
Kern
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users