Hi Nitin,

There's already an issue for breaking the clusterstate.json. Here's the link:
https://issues.apache.org/jira/browse/SOLR-5473

A lot of work has already been done on that one and hopefully, it
should be in trunk soon.


On Wed, Aug 13, 2014 at 3:13 PM, KNitin <nitin.t...@gmail.com> wrote:
> Thanks, Mark. Yes I keep track of the overseer and restart it in the end.
> The only thing that i observe is that as the zookeeper cluster state file
> grows, this behavior gets worse. I notice the following issues
>
>    1. Two nodes (different replicas for the same shard) get stuck in
>    recovering state without either becoming a leader. I thought zk was meant
>    to break ties but doesnt help
>    2. If the recovery fails on a replica, it gets stuck retrying for a very
>    long time (in the order of tens of minutes) before it finally giving
>    up/recovering
>    3. There have been cases 1000 collections restart successfully but takes
>    over 2 hours (because of #2)
>
> The cluster state json file is continuously being updated as the cluster
> restarts (to update core status). Has anyone see this being a big
> bottleneck? Does zookeeper locking files for writes cause a huge issue
> while restarting solr?
>
> Also a side question: Why do we need to have a global cluster state json?
> Is it better to break it down to a per collection state json file?
>
> Thanks for all your help!
> Nitin
>
>
>
>
> On Wed, Aug 13, 2014 at 9:15 AM, Mark Miller <markrmil...@gmail.com> wrote:
>
>> That is good testing :) We should track down what is up with that 30%.
>> Might open a JIRA with some logs.
>>
>> It can help if you restart the overseer node last.
>>
>> There are likely some improvements around this post 4.6.
>>
>> --
>> Mark Miller
>> about.me/markrmiller
>>
>> On August 13, 2014 at 12:05:27 PM, KNitin (nitin.t...@gmail.com) wrote:
>> > Thank u all! Yes I want to disable it for testing purposes
>> >
>> > The main issue is that rolling restart of solrcloud for 1000 collections
>> is
>> > extremely unreliable and slow. More than 30% of the collections fail to
>> > recover.
>> >
>> > What are some good guidelines to follow while restarting a massive
>> cluster
>> > like this ?
>> >
>> > Are there any new improvements (post 4.6) in solr that helps restarts to
>> be
>> > more robust ?
>> >
>> > Thanks
>> >
>> > On Sunday, August 10, 2014, rulinma wrote:
>> >
>> > > good.
>> > >
>> > >
>> > >
>> > > --
>> > > View this message in context:
>> > >
>> http://lucene.472066.n3.nabble.com/Disabling-transaction-logs-tp4151721p4152222.html
>> > > Sent from the Solr - User mailing list archive at Nabble.com.
>> > >
>> >
>>
>>



-- 

Anshum Gupta
http://www.anshumgupta.net

Reply via email to