Some information I forgot to include: Solr version : 7.2.1 Zk version : 3.4.10
-----Original Message----- From: Zarski, Jacek <zars...@dnb.com.INVALID> Sent: Monday, July 30, 2018 4:06 PM To: solr-user@lucene.apache.org Subject: Zookeeper / Solr interaction Hi, We have the following environment setup for zookeeper/solrcould 3 zookeeper ensemble 2 Solr cloud servers I am writing you to further inquire about the interaction of solr and zookeeper. In particular relating to transactions in the transaction logs. I have a script running that logs the amount of transactions. I am matching this log with snapshot timing and new log creation. After a problem arose in our PROD environment, I have tracked it to an unrecommended configuration where logs and data was kept on the same drive. Since then we have configured separate drives for logs and data in that environment. The behavior that caused the problem was when a snapshot was happening, a solr instance reported that it was unable to establish a ZK leader. Following that failure, during recovery, 4 more snapshots happened in short succession(10 minutes) on all 3 zk servers causing the whole environment to be unresponsive until restart for 1.5 hours. I am currently working to recreate the problem and gather more information on the cause and impact of snapshots. I have configured a DEV environment with the same number of servers. I have changed the zk configuration to again have the logs and data in the same drive and directory. I am seeing that snapshots cause a degredation in performance due to IO block but I would like more information on transactions and snapshots to confirm this behavior and our suspicions. Here are the scenarios I would like more information about: 1. When the solr server is restarted, I see a huge influx of transactions on the zookeeper transaction log. What is the solr behavior that is causing this and is this normal? 2. There is scenarios where snapshots are being created without reaching "snapCount" (snapCount=100000) transactions. I have documented snapshots at 17k and 45k transactions. In what scenarios would a snapshot be created other than reaching "snapCount" transactions? 3. Since zk won't respond before writing to the transaction log... at Snapshot time(IO block) is it possible for the solr server to wait for a response from zk causing all other writes to be buffered resulting in a full heap and therefore an out of memory failure on the solr node? a. Now referencing question #1... When a solr node recovers, the influx of transactions plus the continuing writes seems to be enough to trigger another snapshot resulting in further downtime. Is this case plausible? Thanks, Jacek