Re: Solr 7 Nodes Suck in "Gone" State

Shawn Heisey Mon, 29 Apr 2019 22:04:01 -0700

On 4/29/2019 10:55 AM, Marko Babic wrote:

Thanks Shawn.


Yes, all Solr nodes know about all three ZK servers (i.e., the zk host string 
is of the form zk_a_ip:2181,zk_b_ip:2181,zk_c_ip:2181).

Sorry for the dense description of things: I erred on the side of oversharing 
because I didn't want to leave out something useful but I know it makes for an 
investment to read so I really appreciate that you took the time. I'm obviously 
happy to clarify whatever I can.


Trying to trace everything is making my head hurt. :)

Part of the problem is that I do not really know all that much aboutZK's internal operation.

I do know that ZK clients maintain continuous connections (as long asthey are able) to all of the servers in the zkhost string. I'm guessingthat if one of the servers it can reach has been elected leader on theensemble, it will be preferred to all others for that client to talk to.

My reading says that ephemeral nodes should be deleted whenever theclient-server connection is lost for any reason. If I read your writeupcorrectly, somehow the network partition is interfering with thisprocess... the ephemeral node probably is deleted on A (the leader whenthe partition begins) but it is not deleted on the new leader. Thisdoes sound like ZOOKEEPER-2348.

We probably need to take a look at how Solr handles its /live_nodesentries. I have not looked at this code, and have no idea how it works,but here is what I can think of:

Perhaps each Solr node should update its ephemeral node on a timedinterval, say every 5 seconds. Longer if the update operation creates alot of I/O. If the node exists exception is encountered when trying tocreate the node, the node should check the last updated timestamp, andonce it reaches an age of 30 or 60 seconds (definitely configurable),the Solr node should assume that it's safe to delete and recreate. Thelog for this ought to be at WARN or ERROR (probably WARN) so they arevisible in the admin UI. If some of the other devs who live in theSolrCloud code could offer a review of this idea, I would appreciate it.

In theory, two different Solr instances should never be trying to createthe same ephemeral znode. In environments where servers areautomatically provisioned and started, I suppose it could happen.

I've updated the ZK issue with info from this thread. I hope they cancomment on that.


Thanks,
Shawn

Re: Solr 7 Nodes Suck in "Gone" State

Reply via email to