Hi Nitin, There's already an issue for breaking the clusterstate.json. Here's the link: https://issues.apache.org/jira/browse/SOLR-5473
A lot of work has already been done on that one and hopefully, it should be in trunk soon. On Wed, Aug 13, 2014 at 3:13 PM, KNitin <nitin.t...@gmail.com> wrote: > Thanks, Mark. Yes I keep track of the overseer and restart it in the end. > The only thing that i observe is that as the zookeeper cluster state file > grows, this behavior gets worse. I notice the following issues > > 1. Two nodes (different replicas for the same shard) get stuck in > recovering state without either becoming a leader. I thought zk was meant > to break ties but doesnt help > 2. If the recovery fails on a replica, it gets stuck retrying for a very > long time (in the order of tens of minutes) before it finally giving > up/recovering > 3. There have been cases 1000 collections restart successfully but takes > over 2 hours (because of #2) > > The cluster state json file is continuously being updated as the cluster > restarts (to update core status). Has anyone see this being a big > bottleneck? Does zookeeper locking files for writes cause a huge issue > while restarting solr? > > Also a side question: Why do we need to have a global cluster state json? > Is it better to break it down to a per collection state json file? > > Thanks for all your help! > Nitin > > > > > On Wed, Aug 13, 2014 at 9:15 AM, Mark Miller <markrmil...@gmail.com> wrote: > >> That is good testing :) We should track down what is up with that 30%. >> Might open a JIRA with some logs. >> >> It can help if you restart the overseer node last. >> >> There are likely some improvements around this post 4.6. >> >> -- >> Mark Miller >> about.me/markrmiller >> >> On August 13, 2014 at 12:05:27 PM, KNitin (nitin.t...@gmail.com) wrote: >> > Thank u all! Yes I want to disable it for testing purposes >> > >> > The main issue is that rolling restart of solrcloud for 1000 collections >> is >> > extremely unreliable and slow. More than 30% of the collections fail to >> > recover. >> > >> > What are some good guidelines to follow while restarting a massive >> cluster >> > like this ? >> > >> > Are there any new improvements (post 4.6) in solr that helps restarts to >> be >> > more robust ? >> > >> > Thanks >> > >> > On Sunday, August 10, 2014, rulinma wrote: >> > >> > > good. >> > > >> > > >> > > >> > > -- >> > > View this message in context: >> > > >> http://lucene.472066.n3.nabble.com/Disabling-transaction-logs-tp4151721p4152222.html >> > > Sent from the Solr - User mailing list archive at Nabble.com. >> > > >> > >> >> -- Anshum Gupta http://www.anshumgupta.net