Hello Wolfgang,
As I have mentioned before, I believe that this is an OS bug. I have been
able to reproduce it here, and I believe it happens when you send too much
data too fast to the OS. Please see below for more comments:
On Friday 18 August 2006 18:22, Wolfgang Denk wrote:
> Dear Kern,
>
> in message <[EMAIL PROTECTED]> you wrote:
> >
> > Thanks for the feedback. I'm happy to see that your problem is resolved.
>
> I start seeing this problem now, too, especially when running several
> jobs (with data spooling) in parallel, and when the tape gets full.
>
> > I guess that the next time I respond, I should not only mention the fact
that
> > Bacula tends to expose a lot of driver bugs AND hardware problems, both of
> > which are difficult and time consuming to debug ...
>
> I don't think it's either of these in my case. When I run into this
> situation, I can reliably reproduce the problem (but I don't know
> which exact conditions trigger it in the first place). Here is what
> happens:
>
> 17-Aug 00:40 nyx-sd: Spooling data ...
> 17-Aug 00:40 nyx-sd: Committing spooled data to Volume "K-V-DAT-3".
Despooling 1,985,286 bytes ...
> 17-Aug 00:40 nyx-sd: Source-Other.2006-08-17_00.10.04 Error: block.c:538
Write error at 22:2740 on device "DDS-3" (/dev/nst0). ERR=Device or resource
busy.
> 17-Aug 00:40 nyx-sd: Re-read of last block succeeded.
> 17-Aug 00:40 nyx-sd: End of medium on Volume "K-V-DAT-3"
Bytes=15,118,094,426 Blocks=234,370 at 17-Aug-2006 00:40.
> 17-Aug 00:41 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 17-Aug 00:41 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 3.
> 17-Aug 00:41 nyx-sd: 3307 Issuing autochanger "unload slot 3, drive 0"
command.
> 17-Aug 00:42 nyx-sd: 3304 Issuing autochanger "load slot 4, drive 0"
command.
> 17-Aug 00:43 nyx-sd: 3305 Autochanger "load slot 4, drive 0", status is OK.
> 17-Aug 00:43 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 17-Aug 00:43 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 4.
> 17-Aug 00:43 nyx-sd: Wrote label to prelabeled Volume "K-V-DAT-4" on device
"DDS-3" (/dev/nst0)
> 17-Aug 00:43 nyx-sd: New volume "K-V-DAT-4" mounted on device
"DDS-3" (/dev/nst0) at 17-Aug-2006 00:43.
> 17-Aug 00:43 nyx-sd: Source-Other.2006-08-17_00.10.04 Error: block.c:538
Write error at 0:1 on device "DDS-3" (/dev/nst0). ERR=Device or resource
busy.
The above error "Device or resource" should never happen, from what I know
about Linux. The write() system call should be a blocking call and it should
never return busy, thus I consider this is most likely an operating system
bug. The proper behavior is for the OS to simply wait until the I/O is
complete and then return.
At one point, I was able to reproduce this problem here, and in 1.39.x I have
what I consider a really crude workaround. Basically Bacula will retry the
I/O after waiting a bit. Using this technique, I was able to eliminate the
problem here, but I am not really satisfied with the fix as the amount of
time that Bacula needs to wait is indeterminant.
As far as I can tell, this problem occurs because Bacula is feeding data to
the tape drive too fast. I can only guess that the OS memory is getting
completely full and at some point rather than making Bacula wait as it
should, the OS returns an error code. After that point, everything for that
particular job is more or less screwed up as you can see in the output that
follows, because Bacula gets an error during the rewind since the OS is
probably still busy.
> 17-Aug 00:43 nyx-sd: Source-Other.2006-08-17_00.10.04 Error: Re-read of last
block OK, but block numbers differ. Last block=0 Current block=0.
> 17-Aug 00:43 nyx-sd: Source-Other.2006-08-17_00.10.04 Fatal error:
spool.c:249 Fatal append error on device "DDS-3" (/dev/nst0): ERR=dev.c:678
Rewind error on "DDS-3" (/dev/nst0). ERR=Input/output error.
>
> ...
>
> 18-Aug 00:40 nyx-sd: Committing spooled data to Volume "K-V-DAT-4".
Despooling 948,750,016 bytes ...
> 18-Aug 00:40 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Error: block.c:538
Write error at 2:6438 on device "DDS-3" (/dev/nst0). ERR=Device or resource
busy.
> 18-Aug 00:41 nyx-sd: Re-read of last block succeeded.
> 18-Aug 00:41 nyx-sd: End of medium on Volume "K-V-DAT-4" Bytes=691,688,327
Blocks=10,727 at 18-Aug-2006 00:41.
> 18-Aug 00:41 castor-dir: Recycled volume "K-V-DAT-1"
> 18-Aug 00:41 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 00:41 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 4.
> 18-Aug 00:41 nyx-sd: 3307 Issuing autochanger "unload slot 4, drive 0"
command.
> 18-Aug 00:41 nyx-sd: 3304 Issuing autochanger "load slot 1, drive 0"
command.
> 18-Aug 00:42 nyx-sd: 3305 Autochanger "load slot 1, drive 0", status is OK.
> 18-Aug 00:42 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 00:42 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 00:42 nyx-sd: Recycled volume "K-V-DAT-1" on device
"DDS-3" (/dev/nst0), all previous data lost.
> 18-Aug 00:42 nyx-sd: New volume "K-V-DAT-1" mounted on device
"DDS-3" (/dev/nst0) at 18-Aug-2006 00:42.
> 18-Aug 00:42 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Error: block.c:538
Write error at 0:1 on device "DDS-3" (/dev/nst0). ERR=Device or resource
busy.
> 18-Aug 00:42 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Error: Re-read of
last block OK, but block numbers differ. Last block=0 Current block=0.
> 18-Aug 00:42 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Fatal error:
spool.c:249 Fatal append error on device "DDS-3" (/dev/nst0): ERR=dev.c:678
Rewind error on "DDS-3" (/dev/nst0). ERR=Input/output error.
> ...
> 18-Aug 00:42 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 00:42 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 00:42 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538
Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource
busy.
> 18-Aug 00:43 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace
record at EOT failed. ERR=Input/output error
> 18-Aug 00:43 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441
Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error
on "DDS-3" (/dev/nst0). ERR=Input/output error.
>
> 18-Aug 00:43 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog.
>
>
> In this state, the newly mounted volumes ("K-V-DAT-4" and
> "K-V-DAT-1") are empty, i. e. they have NO data on it. Even the
> previously existing bacula labels are gone.
I suspect that this is because the OS has been so totally overloaded by data
that it gets totally lost, which in turn confuses Bacula as it attempts to
salvage what it can from the backup.
>
>
>
> I can now reproduce this failure mode; for example, I try to recycle
> the last corrupted tape (which has no data on it anyway): I unmount
> it, delete it from the catalog, label it again (using the same or
> another volume name), mount it again. The hanging backup job will
> attempt to write on it again, and corrupt it again:
At this point, everything is more or less screwed up for that particular job.
The file descriptor that the job holds for the tape is probably no longer
valid, and as you see below, it continues to immediately get the "Device or
resource busy" error. Thus is is better to cancel the job, and if that does
not resolve the problem, to restart the Storage daemon, because things have
gone terribly wrong.
>
> 18-Aug 01:14 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 01:14 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 01:14 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538
Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource
busy.
> 18-Aug 01:14 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace
record at EOT failed. ERR=Input/output error
> 18-Aug 01:14 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441
Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error
on "DDS-3" (/dev/nst0). ERR=Input/output error.
>
> 18-Aug 01:14 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog.
> 18-Aug 01:14 nyx-sd: Invalid slot=0 defined, cannot autoload Volume.
> 18-Aug 01:14 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 01:38 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 01:38 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 01:38 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538
Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource
busy.
> 18-Aug 01:38 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace
record at EOT failed. ERR=Input/output error
> 18-Aug 01:38 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441
Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error
on "DDS-3" (/dev/nst0). ERR=Input/output error.
>
> 18-Aug 01:38 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog.
> 18-Aug 01:38 nyx-sd: Invalid slot=0 defined, cannot autoload Volume.
> 18-Aug 01:38 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 01:43 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 03:43 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 07:43 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 10:19 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 10:19 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 10:19 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538
Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource
busy.
> 18-Aug 10:19 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace
record at EOT failed. ERR=Input/output error
> 18-Aug 10:19 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441
Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error
on "DDS-3" (/dev/nst0). ERR=Input/output error.
>
> 18-Aug 10:19 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog.
> 18-Aug 10:19 nyx-sd: Invalid slot=0 defined, cannot autoload Volume.
> 18-Aug 10:19 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 10:29 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 10:29 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 10:29 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538
Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource
busy.
> 18-Aug 10:29 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace
record at EOT failed. ERR=Input/output error
> 18-Aug 10:29 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441
Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error
on "DDS-3" (/dev/nst0). ERR=Input/output error.
>
> 18-Aug 10:29 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog.
> 18-Aug 10:29 nyx-sd: Please mount Volume "K-V-DAT-1" on Storage Device
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
>
>
> To me it seems as if bacula somehwat gets confused when several jobs
> are despooling data in parallel, and a medium gets full and is
> changed.
>
>
> All this is with Bacula 1.38.11 (28Jun06).
>
>
> Any ideas?
I think that it is the OS that is getting confused or overloaded. It returns
an error status that should never happen, and from there on Bacula is unable
to recover.
>
>
> Best regards,
>
> Wolfgang Denk
>
> --
> Software Engineering: Embedded and Realtime Systems, Embedded Linux
> Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: [EMAIL PROTECTED]
> When the ax entered the forest, the trees said, "The handle is one of
> us!" -- Turkish proverb
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job
easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Bacula-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/bacula-users
>
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users