1> Yeah, the interactions with ZK are quite chatty. Basically each
replica may have several changes of state with ZooKeeper.
Down->recovering->active. How many replicas do you have on a node?

2> Unfortunately I don't have much info on this point.

3> I would not expect OOMs on the Solr node while waiting for ZK to
respond. How much heap are you allocating to Solr? How many replicas
do you have?

a> Plausible yes, but that many transactions  seems quite high. How
many replicas do you have on your Solr instance? One scenario here is
that you have thousands of replicas. Is your ZK ensemble an external
one? And are they running on separate hardware? Because this many
transactions for only two Solr instances seems quite high so I'm
curious about a few more details of your setup, how many collections,
shards and replicas are we talking here?

Best,
Erick


On Mon, Jul 30, 2018 at 1:10 PM, Zarski, Jacek <zars...@dnb.com.invalid> wrote:
> Some information I forgot to include:
> Solr version : 7.2.1
> Zk version : 3.4.10
>
> -----Original Message-----
> From: Zarski, Jacek <zars...@dnb.com.INVALID>
> Sent: Monday, July 30, 2018 4:06 PM
> To: solr-user@lucene.apache.org
> Subject: Zookeeper / Solr interaction
>
> Hi,
>
> We have the following environment setup for zookeeper/solrcould
>
> 3 zookeeper ensemble
> 2 Solr cloud servers
>
> I am writing you to further inquire about the interaction of solr and 
> zookeeper. In particular relating to transactions in the transaction logs. I 
> have a script running that logs the amount of transactions. I am matching 
> this log with snapshot timing and new log creation.
>
> After a problem arose in our PROD environment, I have tracked it to an 
> unrecommended configuration where logs and data was kept on the same drive. 
> Since then we have configured separate drives for logs and data in that 
> environment. The behavior that caused the problem was when a snapshot was 
> happening, a solr instance reported that it was unable to establish a ZK 
> leader. Following that failure, during recovery,  4 more snapshots happened 
> in short succession(10 minutes) on all 3 zk servers causing the whole 
> environment to be unresponsive until restart for 1.5 hours.
>
> I am currently working to recreate the problem and gather more information on 
> the cause and impact of snapshots. I have configured a DEV environment with 
> the same number of servers. I have changed the zk configuration to again have 
> the logs and data in the same drive and directory. I am seeing that snapshots 
> cause a degredation in performance due to IO block but I would like more 
> information on transactions and snapshots to confirm this behavior and our 
> suspicions.
>
> Here are the scenarios I would like more information about:
>
> 1.       When the solr server is restarted, I see a huge influx of 
> transactions on the zookeeper transaction log. What is the solr behavior that 
> is causing this and is this normal?
>
> 2.       There is scenarios where snapshots are being created without 
> reaching "snapCount" (snapCount=100000) transactions. I have documented 
> snapshots at 17k and 45k transactions. In what scenarios would a snapshot be 
> created other than reaching "snapCount" transactions?
>
> 3.       Since zk won't respond before writing to the transaction log... at 
> Snapshot time(IO block) is it possible for the solr server to wait for a 
> response from zk causing all other writes to be buffered resulting in a full 
> heap and therefore an out of memory failure on the solr node?
>
> a.       Now referencing question #1... When a solr node recovers, the influx 
> of transactions plus the continuing writes seems to be enough to trigger 
> another snapshot resulting in further downtime. Is this case plausible?
>
> Thanks,
> Jacek

Reply via email to