Sure.  Here is our SolrCloud cluster:

   + Three (3) instances of Zookeeper on three separate (physical) servers.  
The ZK servers are beefy and fairly recently built, with 2x10 GigE (bonded) 
Ethernet connectivity to the rest of the data center.  We recognize importance 
of the stability and responsiveness of ZK to the stability of SolrCloud as a 
whole.

   + 364 collections, all with single shards and a replication factor of 3.  
Currently housing only 100,000,000 documents in aggregate.  Expected to grow to 
25 billion+.  The size of a single document would be considered “large”, by the 
standards of what I’ve seen posted elsewhere on this mailing list. 

We are always open to ZK recommendations from you or anyone else, particularly 
for running a SolrCloud cluster of this size.

Kind Regards,

David



On 1/27/16, 12:46 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote:

>
>If you can identify the problem documents, you can just re-index those after 
>forcing a sync. Might save a full rebuild and downtime.
>
>You might describe your cluster setup, including ZK. it sounds like you’ve 
>done your research, but improper ZK node distribution could certainly 
>invalidate some of Solr’s assumptions.
>
>
>
>
>On 1/27/16, 7:59 AM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:
>
>>Jeff, again, very much appreciate your feedback.  
>>
>>It is interesting — the article you linked to by Shalin is exactly why we 
>>picked SolrCloud over ES, because (eventual) consistency is critical for our 
>>application and we will sacrifice availability for it.  To be clear, after 
>>the outage, NONE of our three replicas are correct or complete.
>>
>>So we definitely don’t have CP yet — our very first network outage resulted 
>>in multiple overlapped lost updates.  As a result, I can’t pick one replica 
>>and make it the new “master”.  I must rebuild this collection from scratch, 
>>which I can do, but that requires downtime which is a problem in our app 
>>(24/7 High Availability with few maintenance windows).
>>
>>
>>So, I definitely need to “fix” this somehow.  I wish I could outline a 
>>reproducible test case, but as the root cause is likely very tight timing 
>>issues and complicated interactions with Zookeeper, that is not really an 
>>option.  I’m happy to share the full logs of all 3 replicas though if that 
>>helps.
>>
>>I am curious though if the thoughts have changed since 
>>https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering a 
>>“majority quorum” model, with rollback?  Done properly, this should be free 
>>of all lost update problems, at the cost of availability.  Some SolrCloud 
>>users (like us!!!) would gladly accept that tradeoff.  
>>
>>Regards
>>
>>David
>>
>>

Reply via email to