[Bacula-users] Unstable network causing too many tapes to end up with error status

Gustavo Noronha Mon, 24 Sep 2007 08:23:45 -0700

Hello,

We have a somewhat big setup of Bacula here with lots of clients, some
of them with many differente jobs and filesets (such as the database
servers, which have their datafiles backed up separately and their
archive logs backed up from 30 to 30 minutes).


Our network is quite unstable, and there's not much I can do about some
of the problems we have (because they are handled by third party
companies). Bacula runs mostly OK, but its SD also seems to be the cause
of some of the problems we have: it segfaults from time to time, and
sometimes it seems like it is stalled, and only starts working again
with a restart.

Our main backup hardware is a Dell Poweredge T132, connected to a Dell
server via SCSI. We run Debian GNU/Linux sarge, with the backported
Bacula package (1.38.11-5~bpo.1). We upgraded to 1.38 because we were
having a lot of problems with the 1.36 version which seemed to be fixed
in 1.38, and, in fact, they have gone away. The tape drive is this one:

(scsi1:A:1): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
  Vendor: DELL      Model: PV-132T           Rev: 227D
  Type:   Medium Changer                     ANSI SCSI revision: 02
  Vendor: IBM       Model: ULTRIUM-TD2       Rev: 37RH
  Type:   Sequential-Access                  ANSI SCSI revision: 03

The main problem we have is that some jobs sometimes fail because of the
network being unstable, and Bacula decides to mark the tape as failed.
This one, for instance:

23-Sep 23:52 lab02-sd: Writing spooled data to Volume. Despooling 197,378,916 
bytes ...
23-Sep 23:58 lab02-sd: Writing spooled data to Volume. Despooling 658,402,880 
bytes ...
24-Sep 00:06 rivelino-fd: RIVELINO.2007-09-22_00.01.28 Fatal error: 
c:\cygwin\home\kern\bacula\k\src\win32\filed\../../filed/backup.c:500 Network 
send error to SD. ERR=Input/output error
[...]
24-Sep 00:07 lab02-dir: Rescheduled Job RIVELINO.2007-09-22_00.01.28 at 
24-Sep-2007 00:07 to re-run in 60 seconds (24-Sep-2007 00:08).
24-Sep 00:06 lab02-dir: Job MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24 
waiting 60 seconds for scheduled start time.
24-Sep 00:07 lab02-dir: Start Backup JobId 6634, 
Job=MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24
24-Sep 00:07 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
24-Sep 00:07 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 4.
24-Sep 00:07 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
24-Sep 00:07 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 4.
24-Sep 00:07 lab02-sd: Volume "LTO_0004" previously written, moving to end of 
data.
24-Sep 00:08 lab02-sd: MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24 Error: 
I cannot write on Volume "LTO_0004" because: The number of files mismatch! 
Volume=57 Catalog=56
24-Sep 00:08 lab02-sd: Marking Volume "LTO_0004" in Error in Catalog.
24-Sep 00:09 lab02-dir: Recycled volume "LTO_0006"

The SD also seems to be having trouble. On weekends we run full jobs for
almost all our clients. This sunday morning I went to check how the
backups were going to find all the jobs mostly stalled. I then checked
the SD, entered bconsole, tried a status storage, straced it, and it
seemed to be doing status information checks over and over again
(.status was appearing from time to time), not much was being done, it
seemed. I then stopped it, started it, and then the jobs went ahead.
This is the part of the log that matters:

21-Sep 22:03 mds520-fd: DIR and FD clocks differ by -3 seconds, FD 
automatically adjusting.
21-Sep 22:03 lab02-sd: Spooling data ...
21-Sep 22:11 lab02-sd: Committing spooled data to Volume "LTO_0002". Despooling 
2,780,433,476 bytes ...
21-Sep 22:14 lab02-sd: Sending spooled attrs to the Director. Despooling 3,511 
bytes ...
23-Sep 10:28 mds520-fd: MDS520.2007-09-21_22.01.03 Fatal error: job.c:1614 Comm 
error with SD. bad response to Append Data. ERR=N<C3><A3>o h<C3><A1> dados 
dispon<C3><AD>veis (NOTE: data unavailable)
23-Sep 10:28 lab02-dir: MDS520.2007-09-21_22.01.03 Error: Bacula 1.38.11 
(28Jun06): 23-Sep-2007 10:28:48
  JobId:                  6299
  Job:                    MDS520.2007-09-21_22.01.03
[...]
23-Sep 10:28 lab02-dir: Start Backup JobId 6303, 
Job=MDS513_TS_SYSAUX.2007-09-21_22.30.02
23-Sep 10:33 lab02-sd: Spooling data ...
23-Sep 10:34 lab02-sd: Committing spooled data to Volume "LTO_0008". Despooling 
79,420,983 bytes ...
23-Sep 10:34 lab02-sd: Sending spooled attrs to the Director. Despooling 293 
bytes ...

On the SD log:

21-Sep 22:08 lab02-sd: Committing spooled data to Volume "LTO_0002". Despooling 
1,172,134,636 bytes ...
21-Sep 22:11 lab02-sd: Committing spooled data to Volume "LTO_0002". Despooling 
2,780,433,476 bytes ...
21-Sep 22:14 lab02-sd: Sending spooled attrs to the Director. Despooling 309 
bytes ...
21-Sep 21:30 21-Sep 06:33 21-Sep 05:57 lab02-sd: Writing spooled data to 
Volume. Despooling 5,014,740,654 bytes ...
23-Sep 10:28 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
23-Sep 10:28 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
23-Sep 10:28 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
23-Sep 10:28 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
23-Sep 10:30 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
23-Sep 10:30 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
23-Sep 10:30 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
23-Sep 10:30 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
23-Sep 10:30 lab02-sd: Volume "LTO_0002" previously written, moving to end of 
data.
23-Sep 10:31 lab02-sd: MDS513_TS_SYSTEM.2007-09-21_23.01.01 Error: I cannot 
write on Volume "LTO_0002" because:
The number of files mismatch! Volume=11 Catalog=10
23-Sep 10:31 lab02-sd: Marking Volume "LTO_0002" in Error in Catalog.
23-Sep 10:32 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
23-Sep 10:32 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2.
23-Sep 10:32 lab02-sd: 3307 Issuing autochanger "unload slot 2, drive 0" 
command.
23-Sep 10:32 lab02-sd: 3304 Issuing autochanger "load slot 8, drive 0" command.
23-Sep 10:33 lab02-sd: 3305 Autochanger "load slot 8, drive 0", status is OK.
23-Sep 10:33 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command.
23-Sep 10:33 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 8.
23-Sep 10:33 lab02-sd: Recycled volume "LTO_0008" on device "LTO" (/dev/nst0), 
all previous data lost.
23-Sep 10:33 lab02-sd: Spooling data ...
23-Sep 10:33 lab02-sd: Spooling data ...
23-Sep 10:33 lab02-sd: Spooling data ...

Here are full configuration and log files, I just changed the passwords
and tried to hide the file names on the error messages:

        http://mds510.mds.gov.br/~kov/bacula-info.tar.gz

Anyone has any insight on how we could make Bacula be tougher when these
kinds of network errors happen? Perhaps by adopting the catalog or tape
as canonical and updating the other, instead of marking the tape as in
error? Also, any insight on what is happening to the SD? I am thinking
about limiting the number of concurrent jobs on the SD to 1 to see if it
helps, does that make sense to you?

Thanks,

-- 
Gustavo Noronha <[EMAIL PROTECTED]>
 Coordenação de Sustentação e Segurança/CGI

signature.asc
Description: Esta é uma parte de mensagem assinada digitalmente

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users

[Bacula-users] Unstable network causing too many tapes to end up with error status

Reply via email to