On 12/21/2011 1:10 AM, Shawn Heisey wrote:
On 12/20/2011 10:33 AM, Otis Gospodnetic wrote:
Shawn,
Give httping a try: http://www.vanheusden.com/httping/
It may reveal something about connection being dropped periodically.
Maybe even a plain ping would show some dropped packets if it's a
general network and not a Solr-specific issue.
The connections here are gigabit ethernet on the same VLAN, and
sometimes it happens to cores on the same box that's running the SolrJ
code, which if all things are sane, never actually goes out the NIC.
I see no errors on the interface.
bond0 Link encap:Ethernet HWaddr 00:1C:23:DC:81:53
inet addr:10.100.0.240 Bcast:10.100.1.255 Mask:255.255.254.0
inet6 addr: fe80::21c:23ff:fedc:8153/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:453134140 errors:0 dropped:0 overruns:0 frame:0
TX packets:297893403 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:446857564768 (416.1 GiB) TX bytes:191134876472
(178.0 GiB)
BONDING_OPTS="mode=1 miimon=100 updelay=200 downdelay=200 primary=eth0"
I realized after sending the ifconfig that errors would probably not
show on the bonded interface. Stats are also clear on the slaves:
eth0 Link encap:Ethernet HWaddr 00:1C:23:DC:81:53
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:454373740 errors:0 dropped:0 overruns:0 frame:0
TX packets:301194576 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:449062687599 (418.2 GiB) TX bytes:193031706549
(179.7 GiB)
Interrupt:16 Memory:f8000000-f8012800
eth1 Link encap:Ethernet HWaddr 00:1C:23:DC:81:53
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:2261000 errors:0 dropped:0 overruns:0 frame:0
TX packets:5 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:194331296 (185.3 MiB) TX bytes:398 (398.0 b)
Interrupt:16 Memory:f4000000-f4012800
The switch interfaces are also very clean, as seen below. They do show
some output drops, but the percentage of packets is extremely low.
GigabitEthernet0/13 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 0024.c3cc.ad0d (bia
0024.c3cc.ad0d)
Description: bigindy0 nic1
MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
input flow-control is on, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 1y45w, output 00:00:01, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 74219
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 378000 bits/sec, 81 packets/sec
5 minute output rate 1863000 bits/sec, 210 packets/sec
15993961043 packets input, 18181095872276 bytes, 0 no buffer
Received 31769202 broadcasts (20225268 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 20225268 multicast, 0 pause input
0 input packets with dribble condition detected
21413035341 packets output, 21796346722157 bytes, 0 underruns
0 output errors, 0 collisions, 3 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out
switch uptime 2 years, 27 weeks, 4 days, 21 hours, 20 minutes
host uptime 33 days, 16:21
Even if there were the occasional packet being dropped by the switch,
the TCP stack in Linux should immediately retry that packet and
everything would be fine, though delayed slightly. The number of output
drops here is 0.00035 percent of the total packets output. One of the
other machines (in a different switch) shows ten times as many
switchport drops, but even that is 0.0037 percent of the packets on that
port. I have cleared the counters on on all the switches, and after
twenty minutes and 400000 packets output, it's running completely
clean. I will keep an eye on those stats and wait for the next
exception to see if there is a spike in output drops when the problem
happens. I don't expect that to be the problem, though. If it is a
networking problem, it is most likely to be in the CentOS 6 kernel. I'd
like for it to be that simple, but I think the possibility there is small.
I think it's more likely that it's a software problem, and that the
error was probably mine, but I need help in tracking it down.
Thanks,
Shawn