A couple of weeks ago, a problem started cropping up. Jobs started failing
with what look like network errors:
02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error:
append.c:259 Network error on data channel. ERR=Input/output error
02-Jun 01:10 lorien-sd: Job write elapsed time = 00:03:16, Transfer rate =
4.157 M bytes/second
02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Error: bnet.c:280 Read
expected 65536 got 16384 from client:130.215.39.18:36643
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: Network
error with FD during Backup: ERR=No data available
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: No Job
status returned from FD.
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Error: Bacula 2.0.3
(06Mar07): 02-Jun-2007 01:10:40
However, I can find no evidence of any actual network problem between the
machine running the fd and the one running both the sd and dir:
- The network monitoring system shows no outages, and none of the switches
and routers in between show anything out of the ordinary in the logs.
- There is no external firewall between the two system. Both ends are linux
2.6 with iptables, with non-stateful rules for all bacula traffic.
- IP flow logs show that both ends of the FD -> SD TCP connection
ungracefully closed down the stream with a RST after a very short idle period
of about 10 seconds.
- I've already tried swapping to a different NIC on the server to rule out a
dying network card.
- The failure occurs on different machines, ruling out something specific to
one client, though it usually appears to affect the same one. More
specifically, it always seems to die around the same time - about ten minutes
after the batch of nightly jobs start. I have things configured to run four
concurrent jobs, and the failures will cancel anywhere from one to four jobs.
When multiple jobs die, they all do so at the same time. I can influence
which clients get picked on by shuffling around priorities.
- Running the failed job - either by itself or queued up with a bunch of
other ones - always appear to work as expected.
The part *really* driving me bonkers is that I can find no evidence of any
changes that coincide with the problem starting. Bacula version, kernel
version, hardware, network - nothing was changed.
If anyone has any suggestions where I could start looking, I'd love to hear
them.
--
Frank Sweetser fs at wpi.edu | For every problem, there is a solution that
WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users