Hi Shawn, Thanks for pointing out that it may be due to network/VM issue. I looked the ZK logs in detail and i see below Socket timeout issue after which ZK shutdown is called.
Is that good enough to confirm some VM/network issue not any ZK/Solr issue. I am also including dmesg output during the timestamps we had issues ... ... 2018-10-22 06:03:56,022 [myid:2] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2182:NIOServerCnxn@1044] - Closed socket connection for client /192.3.101.219:55704 which had sessionid 0x5665c67cb0d0000 2018-10-22 06:03:56,022 [myid:2] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2182:Learner@108] - Revalidating client: 0x5665c67cb0d0000 2018-10-22 06:03:56,265 [myid:2] - WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:Follower@89] - Exception when following the leader java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937) 2018-10-22 06:03:56,266 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:Follower@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941) 2018-10-22 06:03:56,266 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:NIOServerCnxn@1044] - Closed socket connection for client /192.72.5.213:57834 which had sessionid 0x46591d67d0c0024 2018-10-22 06:03:56,266 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:NIOServerCnxn@1044] - Closed socket connection for client /192.3.95.181:55192 which had sessionid 0x3665c676caf0004 2018-10-22 06:03:56,266 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:NIOServerCnxn@1044] - Closed socket connection for client /192.3.224.15:38712 which had sessionid 0x2668d42319e0012 ... ... dmesg srch0117 [Mon Oct 22 06:04:37 2018] mptscsih: ioc0: attempting task abort! (sc=ffff88081bdf90c0) [Mon Oct 22 06:04:37 2018] sd 0:0:1:0: [sdb] tag#0 CDB: Write(10) 2a 00 00 50 43 df 00 00 10 00 [Mon Oct 22 06:04:41 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88081bdf90c0) srch0118 [Mon Oct 22 06:04:41 2018] mptscsih: ioc0: attempting task abort! (sc=ffff8807b7c7b200) [Mon Oct 22 06:04:41 2018] sd 0:0:1:0: [sdb] tag#3 CDB: Write(10) 2a 00 00 33 da 80 00 00 08 00 [Mon Oct 22 06:04:49 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff8807b7c7b200) [Mon Oct 22 06:04:49 2018] mptscsih: ioc0: attempting task abort! (sc=ffff88081c09a680) [Mon Oct 22 06:04:49 2018] sd 0:0:1:0: [sdb] tag#4 CDB: Write(10) 2a 00 00 50 13 e8 00 00 0f 00 [Mon Oct 22 06:04:50 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88081c09a680) srch0119 [Mon Oct 22 06:04:30 2018] mptscsih: ioc0: attempting task abort! (sc=ffff8807777e63c0) [Mon Oct 22 06:04:30 2018] sd 0:0:1:0: [sdb] tag#0 CDB: Write(10) 2a 00 00 38 06 b0 00 00 18 00 [Mon Oct 22 06:04:38 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff8807777e63c0) srch0120 Nothing around 6 srch0121 [Mon Oct 22 06:00:01 2018] BTRFS info (device sda1): relocating block group 1273285836800 flags 1 [Mon Oct 22 06:00:02 2018] BTRFS info (device sda1): found 8 extents [Mon Oct 22 06:00:05 2018] BTRFS info (device sda1): found 8 extents [Mon Oct 22 06:00:05 2018] BTRFS info (device sda1): relocating block group 1274527350784 flags 1 [Mon Oct 22 06:00:05 2018] BTRFS info (device sda1): found 8 extents [Mon Oct 22 06:00:07 2018] BTRFS info (device sda1): found 8 extents [Mon Oct 22 06:00:07 2018] BTRFS info (device sda1): relocating block group 1275601092608 flags 1 [Mon Oct 22 06:00:08 2018] BTRFS info (device sda1): found 8 extents [Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): found 8 extents [Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): relocating block group 1274493796352 flags 34 [Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): relocating block group 1277748576256 flags 34 [Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): relocating block group 1277782130688 flags 34 [Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): relocating block group 1277815685120 flags 34 [Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): relocating block group 1277849239552 flags 34 [Fri Oct 19 12:58:00 2018] mptscsih: ioc0: attempting task abort! (sc=ffff88081c17ce00) [Fri Oct 19 12:58:00 2018] sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 00 4f 50 00 00 00 80 00 [Fri Oct 19 12:58:11 2018] mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000 [Fri Oct 19 12:58:37 2018] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [splunkd:2753] [Fri Oct 19 12:58:37 2018] Modules linked in: tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag falcon_lsm_serviceable(PEN) falcon_nf_netcontain(PEN) falcon_lsm_pinned_5607(EN) binfmt_misc af_packet iscsi_ibft iscsi_boot_sysfs vmw_vsock_vmci_transport vsock falcon_lsm_pinned_5305(EN) xfs libcrc32c sb_edac edac_core coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drbg ansi_cprng aesni_intel aes_x86_64 ppdev joydev pcspkr vmxnet3 vmw_balloon lrw gf128mul glue_helper i2c_piix4 ablk_helper cryptd vmw_vmci parport_pc shpchp mptctl parport fjes processor ac arc4 ecb ppp_mppe btrfs sr_mod cdrom ata_generic xor hid_generic usbhid raid6_pq sd_mod ata_piix ahci libahci vmwgfx crc32c_intel serio_raw drm_kms_helper uhci_hcd syscopyarea sysfillrect ehci_pci sysimgblt fb_sys_fops ehci_hcd ttm mptspi [Fri Oct 19 12:58:37 2018] libata usbcore scsi_transport_spi drm mptscsih mptbase usb_common button sg ppp_generic slhc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod ecryptfs autofs4 [last unloaded: falcon_lsm_serviceable] [Fri Oct 19 12:58:37 2018] Supported: No, Proprietary and Unsupported modules are loaded [Fri Oct 19 12:58:37 2018] CPU: 5 PID: 2753 Comm: splunkd Tainted: P E N 4.4.132-94.33-default #1 [Fri Oct 19 12:58:37 2018] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 [Fri Oct 19 12:58:37 2018] task: ffff88081b3b56c0 ti: ffff8808017f0000 task.ti: ffff8808017f0000 [Fri Oct 19 12:58:37 2018] RIP: 0010:[<ffffffff8110acda>] [<ffffffff8110acda>] smp_call_function_single+0xba/0x110 [Fri Oct 19 12:58:37 2018] RSP: 0018:ffff8808017f3d28 EFLAGS: 00000202 [Fri Oct 19 12:58:37 2018] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81e3ca00 [Fri Oct 19 12:58:37 2018] RDX: 0000000000000001 RSI: 00000000000000fb RDI: 0000000000000286 [Fri Oct 19 12:58:37 2018] RBP: ffffffff8106e9d0 R08: ffffffff81f639a0 R09: 0000000000000000 [Fri Oct 19 12:58:37 2018] R10: 00007f8112149d50 R11: ffff8800bbae3e20 R12: 00007f8111f4c000 [Fri Oct 19 12:58:37 2018] R13: ffff88081dbb3800 R14: ffff88081dbb3b70 R15: ffff88081e575400 [Fri Oct 19 12:58:37 2018] FS: 00007f8102dff700(0000) GS:ffff88083fd40000(0000) knlGS:0000000000000000 [Fri Oct 19 12:58:37 2018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Fri Oct 19 12:58:37 2018] CR2: 00007f810239a000 CR3: 00000000bb0e8000 CR4: 0000000000040670 [Fri Oct 19 12:58:37 2018] Stack: [Fri Oct 19 12:58:37 2018] 00000001bf1ea867 0000000000000005 0000000000000000 ffffffff8106e9d0 [Fri Oct 19 12:58:37 2018] ffff8808017f3d70 0000000000000003 00007f8111f4d000 00007f8111f4d000 [Fri Oct 19 12:58:37 2018] ffffffff8106ee8b ffff88081dbb3800 00007f8111f4c000 00007f8111f4d000 [Fri Oct 19 12:58:37 2018] Call Trace: [Fri Oct 19 12:58:37 2018] [<ffffffff8106ee8b>] native_flush_tlb_others+0x10b/0x190 [Fri Oct 19 12:58:37 2018] [<ffffffff8106efab>] flush_tlb_mm_range+0x9b/0x170 [Fri Oct 19 12:58:37 2018] [<ffffffff811be766>] tlb_flush_mmu_tlbonly+0x66/0xc0 [Fri Oct 19 12:58:37 2018] [<ffffffff811bf59f>] tlb_finish_mmu+0xf/0x40 [Fri Oct 19 12:58:37 2018] [<ffffffff811c6ce3>] unmap_region+0xc3/0xf0 [Fri Oct 19 12:58:37 2018] [<ffffffff811c8dbb>] do_munmap+0x20b/0x390 [Fri Oct 19 12:58:37 2018] [<ffffffff811c8f7b>] vm_munmap+0x3b/0x50 [Fri Oct 19 12:58:37 2018] [<ffffffff811c9fdd>] SyS_munmap+0x1d/0x30 [Fri Oct 19 12:58:37 2018] [<ffffffff816173c3>] entry_SYSCALL_64_fastpath+0x1e/0xca [Fri Oct 19 12:58:37 2018] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xca [Fri Oct 19 12:58:37 2018] Leftover inexact backtrace: [Fri Oct 19 12:58:37 2018] Code: 89 ea e8 5a fe ff ff 48 83 c4 30 5b 5d c3 48 8d 74 24 10 48 89 d1 89 df 48 89 ea e8 41 fe ff ff 8b 54 24 28 83 e2 01 74 0b f3 90 <8b> 54 24 28 83 e2 01 75 f5 48 83 c4 30 5b 5d c3 8b 05 08 39 13 [Fri Oct 19 12:58:57 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88081c17ce00) [Fri Oct 19 12:58:57 2018] mptscsih: ioc0: attempting task abort! (sc=ffff88081c17cc80) [Fri Oct 19 12:58:57 2018] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 5c d7 78 00 02 00 00 [Fri Oct 19 12:58:57 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88081c17cc80) [Fri Oct 19 12:58:57 2018] mptscsih: ioc0: attempting task abort! (sc=ffff88081d44e940) [Fri Oct 19 12:58:57 2018] sd 0:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 00 54 8f 70 00 02 00 00 [Fri Oct 19 12:58:57 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88081d44e940) [Fri Oct 19 12:58:57 2018] scsi target0:0:0: Beginning Domain Validation [Fri Oct 19 12:58:57 2018] scsi target0:0:0: Domain Validation skipping write tests [Fri Oct 19 12:58:57 2018] scsi target0:0:0: Ending Domain Validation [Fri Oct 19 12:58:57 2018] scsi target0:0:0: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127) [Fri Oct 19 12:58:58 2018] scsi target0:0:1: Beginning Domain Validation [Fri Oct 19 12:58:58 2018] scsi target0:0:1: Domain Validation skipping write tests [Fri Oct 19 12:58:58 2018] scsi target0:0:1: Ending Domain Validation [Fri Oct 19 12:58:58 2018] scsi target0:0:1: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127) [Fri Oct 19 12:58:58 2018] scsi target0:0:2: Beginning Domain Validation [Fri Oct 19 12:58:58 2018] scsi target0:0:2: Domain Validation skipping write tests [Fri Oct 19 12:58:58 2018] scsi target0:0:2: Ending Domain Validation [Fri Oct 19 12:58:58 2018] scsi target0:0:2: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127) [Fri Oct 19 12:58:58 2018] scsi target0:0:3: Beginning Domain Validation [Fri Oct 19 12:58:58 2018] scsi target0:0:3: Domain Validation skipping write tests [Fri Oct 19 12:58:58 2018] scsi target0:0:3: Ending Domain Validation [Fri Oct 19 12:58:58 2018] scsi target0:0:3: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127) On Mon, Oct 22, 2018 at 10:15 PM Susheel Kumar <susheel2...@gmail.com> wrote: > Hi Shawn, > > Here is the link for Solr GC log and it doesn't look Solr GC problem. The > total GC is 12 GB. The GC log is from yesterday and the issue happened > this morning i.e. 10/22. > > > https://www.dropbox.com/s/zdlu9sk8kc469ls/Screen%20Shot%202018-10-22%20at%2010.08.37%20PM.png?dl=0 > > > It may be network issue but just looking the message "ZookeeperSolr server > not running" and later it instantiate doesn't give any clue. > > Thnx > > On Mon, Oct 22, 2018 at 9:54 PM Shawn Heisey <apa...@elyograg.org> wrote: > >> On 10/22/2018 7:32 PM, Susheel Kumar wrote: >> > Hi Shawn, you meant ZK GC log correct? >> >> There was another potential cause I was thinking of, but when I got to >> where I was going to list them in the previous message, I could not for >> the life of me remember what the other one was. >> >> I just remembered: This problem could be caused by severe network >> connectivity issues between your servers. A few dropped packets >> probably isn't enough ... I think it would have to be a severe problem. >> >> Thanks, >> Shawn >> >>