Hi all, I'm hoping for some troubleshooting advice. I have an OpenIndiana oi_151a8 virtual machine which was functioning correctly on vSphere 5.1 but now isn't on vSphere 5.5 (ESXi-5.5.0-1331820-standard)
A small corner of my network infrastructure has a vSphere host upon which live two virtual machines: ape - "Debian Linux ape 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64 GNU/Linux", uses USB passthrough to read from a APC UPS and e-mail me when power is lost giraffe - oi_151a8, serves up virtual machine images over NFS. Since the upgrade of vSphere from 5.1 to 5.5, virtual machines on other hosts whose VMDKs are on this NFS mount are now very slow. Putty sessions to the oi_151a8 VM also 'stutter', and I see patterns in ping, such as these: Reply from 192.168.0.13: bytes=32 time=1367ms TTL=255 Reply from 192.168.0.13: bytes=32 time<1ms TTL=255 Reply from 192.168.0.13: bytes=32 time<1ms TTL=255 Reply from 192.168.0.13: bytes=32 time=1369ms TTL=255 Reply from 192.168.0.13: bytes=32 time<1ms TTL=255 Reply from 192.168.0.13: bytes=32 time<1ms TTL=255 Reply from 192.168.0.13: bytes=32 time=1356ms TTL=255 Reply from 192.168.0.13: bytes=32 time<1ms TTL=255 Reply from 192.168.0.13: bytes=32 time<1ms TTL=255 Reply from 192.168.0.13: bytes=32 time=1376ms TTL=255 Reply from 192.168.0.13: bytes=32 time<1ms TTL=255 Reply from 192.168.0.13: bytes=32 time<1ms TTL=255 Reply from 192.168.0.13: bytes=32 time<1ms TTL=255 Request timed out. At the same time, pings to the neighbouring VM (ape), or the host follow the normal "time<1ms" pattern, as do pings to other random machines on the network. I've therefore ruled out switch infrastructure, including possibly the vSwitch inside this vSphere host given that the 'giraffe' VM exhibits a problem whereas 'ape' does not. Interestingly, if I power down VMs whose storage lives on giraffe, the pings return to sub 1ms. I am drawing the conclusion that this is some symptom of the combination of OI, vSphere 5.5 & network load, although I'm not sure where to turn next. Tried: "zpool scrub rpool" - to induce high read load on the SSD in the vSphere host. This may look like a strange thing to test, but I've seen odd effects on Windows machines whose storage is struggling in the past. Created a test pool on SSD and induced write load using "cat /dev/zero > /testpool/zerofile". "zpool scrub giraffepool" - to induce high read load on the spinning drives. Still no effect from these three tests, further hinting that it's network load which is a trigger. Checked that ipfilter is off with the following, yet there is a message in dmesg: "IP Filter: v4.1.9, running." chris@giraffe:~# svcs -xv ipfilter svc:/network/ipfilter:default (IP Filter) State: disabled since October 20, 2013 12:17:02 PM UTC Reason: Disabled by an administrator. See: http://illumos.org/msg/SMF-8000-05 See: man -M /usr/share/man -s 5 ipfilter Impact: This service is not running. Haven't tried yet: Installing OI again in another VM to see if the problem is localised to giraffe, since I'd also have to induce load to be confident of the issue existing or not. I'm using the e1000 NIC in vSphere and don't have VM tools installed. Any troubleshooting advice to help me focus somewhere would be appreciated. Many thanks, Chris _______________________________________________ OpenIndiana-discuss mailing list [email protected] http://openindiana.org/mailman/listinfo/openindiana-discuss
