Ciprian Marius Vizitiu wrote:
> Hi listers,
>
> I have a strange firewall problem with Bacula 2.2.6 running on RHEL4
> (2.6.9-67 but it happens on other RHEL4 kernels too) clients and CentOS5
> server. The description of the problem is... long and ugly so I've
> managed to narrow it down to the following easy (for me) to reproduce
> scenario:
>
> 1. One RHEL4 Bacula 2.2.6 client, 192.168.1.25. Relevant iptables in
> this client:
>
> -A RH-Firewall-1-INPUT -p tcp --dport 9101:9103 -j ACCEPT
> -A RH-Firewall-1-INPUT -p udp --dport 9101:9103 -j ACCEPT
>
> 2. One Bacula 2.2.6 server, 192.168.1.48. Relevant iptables in this server:
>
> -A RH-Firewall-1-INPUT -p tcp --dport 9101:9103 -j ACCEPT
> -A RH-Firewall-1-INPUT -p udp --dport 9101:9103 -j ACCEPT
>
> Although there is no 3Com router involved "Hearbeat Interval" is set to
> 60s.
>
> Now, simply start a 23GB restore (full plus a differential) consisting
> of ~70.000 files on the client... everything works as expected for like
> 30 minutes during which the client writes 23GB. Then things start to go
> strange:
>
> 1. On the client there is no activity
> 2. On the server bacula-sd is busy on CPU and I/O most likely searching
> through the 10 x 200GB disk volumes for the differential files to restore.
>
> This "state" will last for another ~30 minutes during which a tcpdump
> will only hear the pings from the heartbeat. Depending on whether the
> firewalls are started or not the end can be one of the following:
>
> No firewall: restore job always ends successfully.
> No firewall: Depending on the positions of the planets either the job
> will succeed THREE HOURS later =:-o or (more likely...) it'll fail with
> a "no route to host" error. Tcpdump started when baculs-sd's job is
> nearing the end will clearly show the culprit:
>
> [... Heartbeat...]
>
> 18:32:01.504760 IP server.gbif.org.9103 > client.gbif.org.32776: P
> 1560794395:1560794427(32) ack 1414218623 win 181 <nop,nop,timestamp
> 4070418385 22509939>
> 18:32:01.504801 IP client.gbif.org > server.gbif.org: icmp 92: host
> client.gbif.org unreachable - admin prohibited
> 18:32:01.505214 IP server.gbif.org.9103 > client.gbif.org.32776: .
> 32:1480(1448) ack 1 win 181 <nop,nop,timestamp 4070418386 22509939>
> 18:32:01.505231 IP client.gbif.org > server.gbif.org: icmp 556: host
> client.gbif.org unreachable - admin prohibited
> 18:32:01.505236 IP server.gbif.org.9103 > client.gbif.org.32776: .
> 1480:2928(1448) ack 1 win 181 <nop,nop,timestamp 4070418386 22509939>
> 18:32:01.505249 IP client.gbif.org > server.gbif.org: icmp 556: host
> client.gbif.org unreachable - admin prohibited
>
> To me it looks like the essence of the problem is the fact that the
> restore session has a long "network idle" period and somehow the RELATED
> mechanism of the firewall no longer works. WHY would this happen? And
> more important, isn't this what HeartBeat was supposed to prevent in the
> first place? One more detail: if the client is RHEL5 everything works
> perfectly.
>
> Has anyone seen something like this before? Any ideas will be
> appreciated! :-|
>
not sure fo 100% but looks a bit like TCP TTL
dont think FW will wait that long and it has nothing to do with heartbeat.
will say/guess as your FW treat it as session closed or timed out cos of
idle time
check if you can manage TTL for TCP on FW.
--
bEsT rEgArDs | "Confidence is what you have before you
tomasz dereszynski | understand the problem." -- Woody Allen
|
Spes confisa Deo | "In theory, theory and practice are much
numquam confusa recedit | the same. In practice they are very
| different." -- Albert Einstein
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users