Gustavo Noronha escreveu:
> Hello,
>
> We have a somewhat big setup of Bacula here with lots of clients, some
> of them with many differente jobs and filesets (such as the database
> servers, which have their datafiles backed up separately and their
> archive logs backed up from 30 to 30 minutes).
>
> Our network is quite unstable, and there's not much I can do about some
> of the problems we have (because they are handled by third party
> companies). Bacula runs mostly OK, but its SD also seems to be the cause
> of some of the problems we have: it segfaults from time to time, and
> sometimes it seems like it is stalled, and only starts working again
> with a restart.
>
> Our main backup hardware is a Dell Poweredge T132, connected to a Dell
> server via SCSI. We run Debian GNU/Linux sarge, with the backported
> Bacula package (1.38.11-5~bpo.1). We upgraded to 1.38 because we were
> having a lot of problems with the 1.36 version which seemed to be fixed
> in 1.38, and, in fact, they have gone away. The tape drive is this one:
>
> (scsi1:A:1): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
> Vendor: DELL Model: PV-132T Rev: 227D
> Type: Medium Changer ANSI SCSI revision: 02
> Vendor: IBM Model: ULTRIUM-TD2 Rev: 37RH
> Type: Sequential-Access ANSI SCSI revision: 03
>
> The main problem we have is that some jobs sometimes fail because of the
> network being unstable, and Bacula decides to mark the tape as failed.
> This one, for instance:
>
> 23-Sep 23:52 lab02-sd: Writing spooled data to Volume. Despooling 197,378,916
> bytes ...
> 23-Sep 23:58 lab02-sd: Writing spooled data to Volume. Despooling 658,402,880
> bytes ...
> 24-Sep 00:06 rivelino-fd: RIVELINO.2007-09-22_00.01.28 Fatal error:
> c:\cygwin\home\kern\bacula\k\src\win32\filed\../../filed/backup.c:500 Network
> send error to SD. ERR=Input/output error
> [...]
> 24-Sep 00:07 lab02-dir: Rescheduled Job RIVELINO.2007-09-22_00.01.28 at
> 24-Sep-2007 00:07 to re-run in 60 seconds (24-Sep-2007 00:08).
> 24-Sep 00:06 lab02-dir: Job MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24
> waiting 60 seconds for scheduled start time.
> 24-Sep 00:07 lab02-dir: Start Backup JobId 6634,
> Job=MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24
> 24-Sep 00:07 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 24-Sep 00:07 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 4.
> 24-Sep 00:07 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 24-Sep 00:07 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 4.
> 24-Sep 00:07 lab02-sd: Volume "LTO_0004" previously written, moving to end of
> data.
> 24-Sep 00:08 lab02-sd: MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24
> Error: I cannot write on Volume "LTO_0004" because: The number of files
> mismatch! Volume=57 Catalog=56
> 24-Sep 00:08 lab02-sd: Marking Volume "LTO_0004" in Error in Catalog.
> 24-Sep 00:09 lab02-dir: Recycled volume "LTO_0006"
>
> The SD also seems to be having trouble. On weekends we run full jobs for
> almost all our clients. This sunday morning I went to check how the
> backups were going to find all the jobs mostly stalled. I then checked
> the SD, entered bconsole, tried a status storage, straced it, and it
> seemed to be doing status information checks over and over again
> (.status was appearing from time to time), not much was being done, it
> seemed. I then stopped it, started it, and then the jobs went ahead.
> This is the part of the log that matters:
>
> 21-Sep 22:03 mds520-fd: DIR and FD clocks differ by -3 seconds, FD
> automatically adjusting.
> 21-Sep 22:03 lab02-sd: Spooling data ...
> 21-Sep 22:11 lab02-sd: Committing spooled data to Volume "LTO_0002".
> Despooling 2,780,433,476 bytes ...
> 21-Sep 22:14 lab02-sd: Sending spooled attrs to the Director. Despooling
> 3,511 bytes ...
> 23-Sep 10:28 mds520-fd: MDS520.2007-09-21_22.01.03 Fatal error: job.c:1614
> Comm error with SD. bad response to Append Data. ERR=N<C3><A3>o h<C3><A1>
> dados dispon<C3><AD>veis (NOTE: data unavailable)
> 23-Sep 10:28 lab02-dir: MDS520.2007-09-21_22.01.03 Error: Bacula 1.38.11
> (28Jun06): 23-Sep-2007 10:28:48
> JobId: 6299
> Job: MDS520.2007-09-21_22.01.03
> [...]
> 23-Sep 10:28 lab02-dir: Start Backup JobId 6303,
> Job=MDS513_TS_SYSAUX.2007-09-21_22.30.02
> 23-Sep 10:33 lab02-sd: Spooling data ...
> 23-Sep 10:34 lab02-sd: Committing spooled data to Volume "LTO_0008".
> Despooling 79,420,983 bytes ...
> 23-Sep 10:34 lab02-sd: Sending spooled attrs to the Director. Despooling 293
> bytes ...
>
> On the SD log:
>
> 21-Sep 22:08 lab02-sd: Committing spooled data to Volume "LTO_0002".
> Despooling 1,172,134,636 bytes ...
> 21-Sep 22:11 lab02-sd: Committing spooled data to Volume "LTO_0002".
> Despooling 2,780,433,476 bytes ...
> 21-Sep 22:14 lab02-sd: Sending spooled attrs to the Director. Despooling 309
> bytes ...
> 21-Sep 21:30 21-Sep 06:33 21-Sep 05:57 lab02-sd: Writing spooled data to
> Volume. Despooling 5,014,740,654 bytes ...
> 23-Sep 10:28 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:28 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
> 23-Sep 10:28 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:28 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
> 23-Sep 10:30 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:30 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
> 23-Sep 10:30 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:30 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
> 23-Sep 10:30 lab02-sd: Volume "LTO_0002" previously written, moving to end of
> data.
> 23-Sep 10:31 lab02-sd: MDS513_TS_SYSTEM.2007-09-21_23.01.01 Error: I cannot
> write on Volume "LTO_0002" because:
> The number of files mismatch! Volume=11 Catalog=10
> 23-Sep 10:31 lab02-sd: Marking Volume "LTO_0002" in Error in Catalog.
> 23-Sep 10:32 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:32 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
> 23-Sep 10:32 lab02-sd: 3307 Issuing autochanger "unload slot 2, drive 0"
> command.
> 23-Sep 10:32 lab02-sd: 3304 Issuing autochanger "load slot 8, drive 0"
> command.
> 23-Sep 10:33 lab02-sd: 3305 Autochanger "load slot 8, drive 0", status is OK.
> 23-Sep 10:33 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:33 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 8.
> 23-Sep 10:33 lab02-sd: Recycled volume "LTO_0008" on device "LTO"
> (/dev/nst0), all previous data lost.
> 23-Sep 10:33 lab02-sd: Spooling data ...
> 23-Sep 10:33 lab02-sd: Spooling data ...
> 23-Sep 10:33 lab02-sd: Spooling data ...
>
> Here are full configuration and log files, I just changed the passwords
> and tried to hide the file names on the error messages:
>
> http://mds510.mds.gov.br/~kov/bacula-info.tar.gz
>
> Anyone has any insight on how we could make Bacula be tougher when these
> kinds of network errors happen? Perhaps by adopting the catalog or tape
> as canonical and updating the other, instead of marking the tape as in
> error? Also, any insight on what is happening to the SD? I am thinking
> about limiting the number of concurrent jobs on the SD to 1 to see if it
> helps, does that make sense to you?
>
>
Hi Gustavo. I'm from Brazil too :-)
Did you think about create another network just for backups ? Today
servers comming with two network cards, and you can use this network
just for backup and avoid this problems you have.
I also recommend you to upgrade to 2.2.4 version, because of serious
bug #935:
http://www.bacula.org/downloads/bug-935.txt
--
Jeronimo Zucco
LPIC-1 Linux Professional Institute Certified
NĂșcleo de Processamento de Dados
Universidade de Caxias do Sul
http://jczucco.blogspot.com
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users