I saw an overseer queue clogged as well due to a bad message in the queue.
Unfortunately this went unnoticed for a while until there were 130K messages
in the overseer queue. Since it was a production system we were not able to
simply stop everything and delete all Zookeeper data, so we manually deleted
messages by issuing commands directly through the zkCli.sh tool. After all
the messages had been cleared, some nodes were in the wrong state (e.g.
'down' when should have been 'active'). Restarting the 'down' or 'recovery
failed' nodes brought the whole cluster back to a stable and healthy state.

Since it can take some digging to determine backlog in the overseer queue,
some of the symptoms we saw were:
Overseer throwing an exception like "Path must not end with / character"
Random nodes throwing an exception like "ClusterState says we are the
leader, but locally we don't think so"
Bringing up new replicas time out when attempting to fetch shard id



--
View this message in context: 
http://lucene.472066.n3.nabble.com/overseer-queue-clogged-tp4047878p4134129.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to