On 12/21/2011 1:10 AM, Shawn Heisey wrote:
On 12/20/2011 10:33 AM, Otis Gospodnetic wrote:
Shawn,

Give httping a try: http://www.vanheusden.com/httping/

It may reveal something about connection being dropped periodically.
Maybe even a plain ping would show some dropped packets if it's a general network and not a Solr-specific issue.

The connections here are gigabit ethernet on the same VLAN, and sometimes it happens to cores on the same box that's running the SolrJ code, which if all things are sane, never actually goes out the NIC. I see no errors on the interface.

bond0     Link encap:Ethernet  HWaddr 00:1C:23:DC:81:53
          inet addr:10.100.0.240  Bcast:10.100.1.255  Mask:255.255.254.0
          inet6 addr: fe80::21c:23ff:fedc:8153/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:453134140 errors:0 dropped:0 overruns:0 frame:0
          TX packets:297893403 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
RX bytes:446857564768 (416.1 GiB) TX bytes:191134876472 (178.0 GiB)

BONDING_OPTS="mode=1 miimon=100 updelay=200 downdelay=200 primary=eth0"

I realized after sending the ifconfig that errors would probably not show on the bonded interface. Stats are also clear on the slaves:

eth0      Link encap:Ethernet  HWaddr 00:1C:23:DC:81:53
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:454373740 errors:0 dropped:0 overruns:0 frame:0
          TX packets:301194576 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
RX bytes:449062687599 (418.2 GiB) TX bytes:193031706549 (179.7 GiB)
          Interrupt:16 Memory:f8000000-f8012800

eth1      Link encap:Ethernet  HWaddr 00:1C:23:DC:81:53
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:2261000 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:194331296 (185.3 MiB)  TX bytes:398 (398.0 b)
          Interrupt:16 Memory:f4000000-f4012800

The switch interfaces are also very clean, as seen below. They do show some output drops, but the percentage of packets is extremely low.

GigabitEthernet0/13 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 0024.c3cc.ad0d (bia 0024.c3cc.ad0d)
  Description: bigindy0 nic1
  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
     reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
  input flow-control is on, output flow-control is unsupported
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 1y45w, output 00:00:01, output hang never
  Last clearing of "show interface" counters never
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 74219
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  5 minute input rate 378000 bits/sec, 81 packets/sec
  5 minute output rate 1863000 bits/sec, 210 packets/sec
     15993961043 packets input, 18181095872276 bytes, 0 no buffer
     Received 31769202 broadcasts (20225268 multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 20225268 multicast, 0 pause input
     0 input packets with dribble condition detected
     21413035341 packets output, 21796346722157 bytes, 0 underruns
     0 output errors, 0 collisions, 3 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out

switch uptime 2 years, 27 weeks, 4 days, 21 hours, 20 minutes
host uptime 33 days, 16:21

Even if there were the occasional packet being dropped by the switch, the TCP stack in Linux should immediately retry that packet and everything would be fine, though delayed slightly. The number of output drops here is 0.00035 percent of the total packets output. One of the other machines (in a different switch) shows ten times as many switchport drops, but even that is 0.0037 percent of the packets on that port. I have cleared the counters on on all the switches, and after twenty minutes and 400000 packets output, it's running completely clean. I will keep an eye on those stats and wait for the next exception to see if there is a spike in output drops when the problem happens. I don't expect that to be the problem, though. If it is a networking problem, it is most likely to be in the CentOS 6 kernel. I'd like for it to be that simple, but I think the possibility there is small.

I think it's more likely that it's a software problem, and that the error was probably mine, but I need help in tracking it down.

Thanks,
Shawn

Reply via email to