Re: [Bacula-users] Unstable network causing too many tapes to end up with error status

Jeronimo Zucco Tue, 25 Sep 2007 12:35:33 -0700

Gustavo Noronha escreveu:
> Hello,
>
> We have a somewhat big setup of Bacula here with lots of clients, some
> of them with many differente jobs and filesets (such as the database
> servers, which have their datafiles backed up separately and their
> archive logs backed up from 30 to 30 minutes). 
>
> Our network is quite unstable, and there's not much I can do about some
> of the problems we have (because they are handled by third party
> companies). Bacula runs mostly OK, but its SD also seems to be the cause
> of some of the problems we have: it segfaults from time to time, and
> sometimes it seems like it is stalled, and only starts working again
> with a restart.
>
> Our main backup hardware is a Dell Poweredge T132, connected to a Dell
> server via SCSI. We run Debian GNU/Linux sarge, with the backported
> Bacula package (1.38.11-5~bpo.1). We upgraded to 1.38 because we were
> having a lot of problems with the 1.36 version which seemed to be fixed
> in 1.38, and, in fact, they have gone away. The tape drive is this one:
>
> (scsi1:A:1): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
>   Vendor: DELL      Model: PV-132T           Rev: 227D
>   Type:   Medium Changer                     ANSI SCSI revision: 02
>   Vendor: IBM       Model: ULTRIUM-TD2       Rev: 37RH
>   Type:   Sequential-Access                  ANSI SCSI revision: 03
>
> The main problem we have is that some jobs sometimes fail because of the
> network being unstable, and Bacula decides to mark the tape as failed.
> This one, for instance:
>
> 23-Sep 23:52 lab02-sd: Writing spooled data to Volume. Despooling 197,378,916 
> bytes ...
> 23-Sep 23:58 lab02-sd: Writing spooled data to Volume. Despooling 658,402,880 
> bytes ...
> 24-Sep 00:06 rivelino-fd: RIVELINO.2007-09-22_00.01.28 Fatal error: 
> c:\cygwin\home\kern\bacula\k\src\win32\filed\../../filed/backup.c:500 Network 
> send error to SD. ERR=Input/output error
> [...]
> 24-Sep 00:07 lab02-dir: Rescheduled Job RIVELINO.2007-09-22_00.01.28 at 
> 24-Sep-2007 00:07 to re-run in 60 seconds (24-Sep-2007 00:08).
> 24-Sep 00:06 lab02-dir: Job MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24 
> waiting 60 seconds for scheduled start time.
> 24-Sep 00:07 lab02-dir: Start Backup JobId 6634, 
> Job=MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24
> 24-Sep 00:07 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 24-Sep 00:07 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 4.
> 24-Sep 00:07 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 24-Sep 00:07 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 4.
> 24-Sep 00:07 lab02-sd: Volume "LTO_0004" previously written, moving to end of 
> data.
> 24-Sep 00:08 lab02-sd: MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24 
> Error: I cannot write on Volume "LTO_0004" because: The number of files 
> mismatch! Volume=57 Catalog=56
> 24-Sep 00:08 lab02-sd: Marking Volume "LTO_0004" in Error in Catalog.
> 24-Sep 00:09 lab02-dir: Recycled volume "LTO_0006"
>
> The SD also seems to be having trouble. On weekends we run full jobs for
> almost all our clients. This sunday morning I went to check how the
> backups were going to find all the jobs mostly stalled. I then checked
> the SD, entered bconsole, tried a status storage, straced it, and it
> seemed to be doing status information checks over and over again
> (.status was appearing from time to time), not much was being done, it
> seemed. I then stopped it, started it, and then the jobs went ahead.
> This is the part of the log that matters:
>
> 21-Sep 22:03 mds520-fd: DIR and FD clocks differ by -3 seconds, FD 
> automatically adjusting.
> 21-Sep 22:03 lab02-sd: Spooling data ...
> 21-Sep 22:11 lab02-sd: Committing spooled data to Volume "LTO_0002". 
> Despooling 2,780,433,476 bytes ...
> 21-Sep 22:14 lab02-sd: Sending spooled attrs to the Director. Despooling 
> 3,511 bytes ...
> 23-Sep 10:28 mds520-fd: MDS520.2007-09-21_22.01.03 Fatal error: job.c:1614 
> Comm error with SD. bad response to Append Data. ERR=N<C3><A3>o h<C3><A1> 
> dados dispon<C3><AD>veis (NOTE: data unavailable)
> 23-Sep 10:28 lab02-dir: MDS520.2007-09-21_22.01.03 Error: Bacula 1.38.11 
> (28Jun06): 23-Sep-2007 10:28:48
>   JobId:                  6299
>   Job:                    MDS520.2007-09-21_22.01.03
> [...]
> 23-Sep 10:28 lab02-dir: Start Backup JobId 6303, 
> Job=MDS513_TS_SYSAUX.2007-09-21_22.30.02
> 23-Sep 10:33 lab02-sd: Spooling data ...
> 23-Sep 10:34 lab02-sd: Committing spooled data to Volume "LTO_0008". 
> Despooling 79,420,983 bytes ...
> 23-Sep 10:34 lab02-sd: Sending spooled attrs to the Director. Despooling 293 
> bytes ...
>
> On the SD log:
>
> 21-Sep 22:08 lab02-sd: Committing spooled data to Volume "LTO_0002". 
> Despooling 1,172,134,636 bytes ...
> 21-Sep 22:11 lab02-sd: Committing spooled data to Volume "LTO_0002". 
> Despooling 2,780,433,476 bytes ...
> 21-Sep 22:14 lab02-sd: Sending spooled attrs to the Director. Despooling 309 
> bytes ...
> 21-Sep 21:30 21-Sep 06:33 21-Sep 05:57 lab02-sd: Writing spooled data to 
> Volume. Despooling 5,014,740,654 bytes ...
> 23-Sep 10:28 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:28 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
> 23-Sep 10:28 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:28 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
> 23-Sep 10:30 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:30 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
> 23-Sep 10:30 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:30 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
> 23-Sep 10:30 lab02-sd: Volume "LTO_0002" previously written, moving to end of 
> data.
> 23-Sep 10:31 lab02-sd: MDS513_TS_SYSTEM.2007-09-21_23.01.01 Error: I cannot 
> write on Volume "LTO_0002" because:
> The number of files mismatch! Volume=11 Catalog=10
> 23-Sep 10:31 lab02-sd: Marking Volume "LTO_0002" in Error in Catalog.
> 23-Sep 10:32 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:32 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
> 23-Sep 10:32 lab02-sd: 3307 Issuing autochanger "unload slot 2, drive 0" 
> command.
> 23-Sep 10:32 lab02-sd: 3304 Issuing autochanger "load slot 8, drive 0" 
> command.
> 23-Sep 10:33 lab02-sd: 3305 Autochanger "load slot 8, drive 0", status is OK.
> 23-Sep 10:33 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 23-Sep 10:33 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 8.
> 23-Sep 10:33 lab02-sd: Recycled volume "LTO_0008" on device "LTO" 
> (/dev/nst0), all previous data lost.
> 23-Sep 10:33 lab02-sd: Spooling data ...
> 23-Sep 10:33 lab02-sd: Spooling data ...
> 23-Sep 10:33 lab02-sd: Spooling data ...
>
> Here are full configuration and log files, I just changed the passwords
> and tried to hide the file names on the error messages:
>
>       http://mds510.mds.gov.br/~kov/bacula-info.tar.gz
>
> Anyone has any insight on how we could make Bacula be tougher when these
> kinds of network errors happen? Perhaps by adopting the catalog or tape
> as canonical and updating the other, instead of marking the tape as in
> error? Also, any insight on what is happening to the SD? I am thinking
> about limiting the number of concurrent jobs on the SD to 1 to see if it
> helps, does that make sense to you?
>
>


Hi Gustavo. I'm from Brazil too :-)

    Did you think about create another network just for backups ?  Today 
servers comming with two network cards, and you can use this network 
just for backup and avoid this problems you have.

    I also recommend you to upgrade to 2.2.4 version, because of serious 
bug #935:

http://www.bacula.org/downloads/bug-935.txt


   

-- 
Jeronimo Zucco
LPIC-1 Linux Professional Institute Certified
Núcleo de Processamento de Dados
Universidade de Caxias do Sul

http://jczucco.blogspot.com


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Unstable network causing too many tapes to end up with error status

Reply via email to