Re: SolrCloud replicas consistently out of sync

Jeff Wartes Thu, 19 May 2016 09:19:48 -0700

That case related to consistency after a ZK outage or network connectivity 
issue. Your case is standard operation, so I’m not sure that’s really the same 
thing. I’m aware of a few issues that cam happen if ZK connectivity goes wonky, 
that I hope are fixed in SOLR-8697.


This one might be a closer match to your problem though: 
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3CCAOWq+=iePCJjnQiSqxgDVEPv42Pi7RUtw0X0=9f67mpcm99...@mail.gmail.com%3E




On 5/19/16, 9:10 AM, "Aleksey Mezhva" <aleksey.mez...@wgsn.com> wrote:

>Bump.
>
>this thread is with someone having a similar issue:
>
>https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3c09fdab82-7600-49e0-b639-9cb9db937...@yahoo.com%3E
>
>It seems like this is not really fixed in 5.4/6.0?
>
>
>Aleksey
>
>From: Steve Weiss <steve.we...@wgsn.com>
>Date: Tuesday, May 17, 2016 at 7:25 PM
>To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
>Cc: Aleksey Mezhva <aleksey.mez...@wgsn.com>, Hans Zhou <hans.z...@wgsn.com>
>Subject: Re: SolrCloud replicas consistently out of sync
>
>Gotcha - well that's nice.  Still, we seem to be permanently out of sync.
>
>I see this thread with someone having a similar issue:
>
>https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3c09fdab82-7600-49e0-b639-9cb9db937...@yahoo.com%3E
>
>It seems like this is not really fixed in 5.4/6.0?  Is there any version of 
>SolrCloud where this wasn't yet a problem that we could downgrade to?
>
>--
>Steve
>
>On Tue, May 17, 2016 at 6:23 PM, Markus Jelsma 
><markus.jel...@openindex.io<mailto:markus.jel...@openindex.io>> wrote:
>Hi, thats a known issue and unrelated:
>https://issues.apache.org/jira/browse/SOLR-9120
>
>M.
>
>
>-----Original message-----
>> From:Stephen Weiss <steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>>
>> Sent: Tuesday 17th May 2016 23:10
>> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>; Aleksey 
>> Mezhva <aleksey.mez...@wgsn.com<mailto:aleksey.mez...@wgsn.com>>; Hans Zhou 
>> <hans.z...@wgsn.com<mailto:hans.z...@wgsn.com>>
>> Subject: Re: SolrCloud replicas consistently out of sync
>>
>> I should add - looking back through the logs, we're seeing frequent errors 
>> like this now:
>>
>> 78819692 WARN  (qtp110456297-1145) [   ] o.a.s.h.a.LukeRequestHandler Error 
>> getting file length for [segments_4o]
>> java.nio.file.NoSuchFileException: 
>> /var/solr/data/instock_shard5_replica1/data/index.20160516230059221/segments_4o
>>
>> --
>> Steve
>>
>>
>> On Tue, May 17, 2016 at 5:07 PM, Stephen Weiss 
>> <steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>>>
>>  wrote:
>> OK, so we did as you suggest, read through that article, and we reconfigured 
>> the autocommit to:
>>
>> <autoCommit>
>> <maxTime>${solr.autoCommit.maxTime:30000}</maxTime>
>> <openSearcher>false</openSearcher>
>> </autoCommit>
>>
>> <autoSoftCommit>
>> <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
>> </autoSoftCommit>
>>
>> However, we see no change, aside from the fact that it's clearly committing 
>> more frequently.  I will say on our end, we clearly misunderstood the 
>> difference between soft and hard commit, but even now having it configured 
>> this way, we are still totally out of sync, long after all indexing has 
>> completed (it's been about 30 minutes now).  We manually pushed through a 
>> commit on the whole collection as suggested, however, all we get back for 
>> that is o.a.s.u.DirectUpdateHandler2 No uncommitted changes. Skipping 
>> IW.commit., which makes sense, because it was all committed already anyway.
>>
>> We still currently have all shards mismatched:
>>
>> instock_shard1   replica 1: 30788491 replica 2: 30778865
>> instock_shard10   replica 1: 30973059 replica 2: 30971874
>> instock_shard11   replica 2: 31036815 replica 1: 31034715
>> instock_shard12   replica 2: 30177084 replica 1: 30170511
>> instock_shard13   replica 2: 30608225 replica 1: 30603923
>> instock_shard14   replica 2: 30755739 replica 1: 30753191
>> instock_shard15   replica 2: 30891713 replica 1: 30891528
>> instock_shard16   replica 1: 30818567 replica 2: 30817152
>> instock_shard17   replica 1: 30423877 replica 2: 30422742
>> instock_shard18   replica 2: 30874979 replica 1: 30872223
>> instock_shard19   replica 2: 30917208 replica 1: 30909999
>> instock_shard2   replica 1: 31062339 replica 2: 31060575
>> instock_shard20   replica 1: 30192046 replica 2: 30190893
>> instock_shard21   replica 2: 30793817 replica 1: 30791135
>> instock_shard22   replica 2: 30821521 replica 1: 30818836
>> instock_shard23   replica 2: 30553773 replica 1: 30547336
>> instock_shard24   replica 1: 30975564 replica 2: 30971170
>> instock_shard25   replica 1: 30734696 replica 2: 30731682
>> instock_shard26   replica 1: 31465696 replica 2: 31464738
>> instock_shard27   replica 1: 30844884 replica 2: 30842445
>> instock_shard28   replica 2: 30549826 replica 1: 30547405
>> instock_shard29   replica 2: 30637777 replica 1: 30634091
>> instock_shard3   replica 1: 30930723 replica 2: 30926483
>> instock_shard30   replica 2: 30904528 replica 1: 30902649
>> instock_shard31   replica 2: 31175813 replica 1: 31174921
>> instock_shard32   replica 2: 30932837 replica 1: 30926456
>> instock_shard4   replica 2: 30758100 replica 1: 30754129
>> instock_shard5   replica 2: 31008893 replica 1: 31002581
>> instock_shard6   replica 2: 31008679 replica 1: 31005380
>> instock_shard7   replica 2: 30738468 replica 1: 30737795
>> instock_shard8   replica 2: 30620929 replica 1: 30616715
>> instock_shard9   replica 1: 31071386 replica 2: 31066956
>>
>> The fact that the min_rf numbers aren't coming back as 2 seems to indicate 
>> to me that documents simply aren't making it to both replicas - why would 
>> that have anything to do with committing anyway?
>>
>> Something else is amiss here.  Too bad, committing sounded like an easy 
>> answer!
>>
>> --
>> Steve
>>
>>
>> On Tue, May 17, 2016 at 11:39 AM, Erick Erickson 
>> <erickerick...@gmail.com<mailto:erickerick...@gmail.com><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com>>>
>>  wrote:
>> OK, these autocommit settings need revisiting.
>>
>> First off, I'd remove the maxDocs entirely although with the setting
>> you're using it probably doesn't matter.
>>
>> The maxTime of 1,200,000 is 20 minutes. Which means if you evern
>> un-gracefully kill your shards you'll have up to 20 minutes worth of
>> data to replay from the tlog.... or resynch from the leader. Make this
>> much shorter (60000 or less) and be sure to gracefully kill your Solrs.
>> no "kill -9" for intance....
>>
>> To be sure, before you bounce servers try either waiting 20 minutes
>> after the indexing stops or issue a manual commit before shutting
>> down your servers with
>> http://..../solr/collection/update?commit=true
>>
>> I have a personal annoyance with the bin/solr script where it forcefully
>> (ungracefully) kills Solr after 5 seconds. I think this is much too short
>> so you might consider making it longer in prod, it's a shell script so
>> it's easy.
>>
>> <autoCommit>
>> <maxTime>${solr.autoCommit.maxTime:1200000}</maxTime>
>> <maxDocs>${solr.autoCommit.maxDocs:1000000000}</maxDocs>
>> <openSearcher>false</openSearcher>
>> </autoCommit>
>>
>>
>> this is probably the  crux of "shards being out of sync". They're _not_
>> out of sync, it's just that some of them have docs visible to searches
>> and some do not since the wall-clock time these are triggered are
>> _not_ the same. So you have a 10 minute window where two or more
>> replicas for a single shard are out-of-sync.
>>
>>
>> <autoSoftCommit>
>> <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
>> </autoSoftCommit>
>>
>> You can test all this one of two ways:
>> 1> if you have a timestamp when the docs were indexed, do all
>> the shards match if you do a query like
>> q=*:*&timestamp:[* TO NOW/-15MINUTES]?
>> or, if indexing is _not_ occurring, issue a manual commit like
>> .../solr/collection/update?commit=true
>> and see if all the replicas match for each shard.
>>
>> Here's a long blog on commits:
>> https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>>
>> Best,
>> Erick
>>
>> On Tue, May 17, 2016 at 8:18 AM, Stephen Weiss 
>> <steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>>>
>>  wrote:
>> > Yes, after startup there was a recovery process, you are right.  It's just 
>> > that this process doesn't seem to happen unless we do a full restart.
>> >
>> > These are our autocommit settings - to be honest, we did not really use 
>> > autocommit until we switched up to SolrCloud so it's totally possible they 
>> > are not very good settings.  We wanted to minimize the frequency of 
>> > commits because the commits seem to create a performance drag during 
>> > indexing.   Perhaps it's gone overboard?
>> >
>> > <autoCommit>
>> > <maxTime>${solr.autoCommit.maxTime:1200000}</maxTime>
>> > <maxDocs>${solr.autoCommit.maxDocs:1000000000}</maxDocs>
>> > <openSearcher>false</openSearcher>
>> > </autoCommit>
>> > <autoSoftCommit>
>> > <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
>> > </autoSoftCommit>
>> >
>> > By nodes, I am indeed referring to machines.  There are 8 shards per 
>> > machine (2 replicas of each), all in one JVM a piece.  We haven't 
>> > specified any specific timestamps for the logs - they are just whatever 
>> > happens by default.
>> >
>> > --
>> > Steve
>> >
>> > On Mon, May 16, 2016 at 11:50 PM, Erick Erickson 
>> > <erickerick...@gmail.com<mailto:erickerick...@gmail.com><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com>><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com>>>>
>> >  wrote:
>> > OK, this is very strange. There's no _good_ reason that
>> > restarting the servers should make a difference. The fact
>> > that it took 1/2 hour leads me to believe, though, that your
>> > shards are somehow "incomplete", especially that you
>> > are indexing to the system and don't have, say,
>> > your autocommit settings done very well. The long startup
>> > implies (guessing) that you have pretty big tlogs that
>> > are replayed upon startup. While these were coming up,
>> > did you see any of the shards in the "recovering" state? That's
>> > the only way I can imagine that Solr "healed" itself.
>> >
>> > I've got to point back to the Solr logs. Are they showing
>> > any anomalies? Are any nodes in recovery when you restart?
>> >
>> > Best,
>> > Erick
>> >
>> >
>> >
>> > On Mon, May 16, 2016 at 4:14 PM, Stephen Weiss 
>> > <steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>>>>
>> >  wrote:
>> >> Just one more note - while experimenting, I found that if I stopped all 
>> >> nodes (full cluster shutdown), and then startup all nodes, they do in 
>> >> fact seem to repair themselves then.  We have a script to monitor the 
>> >> differences between replicas (just looking at numDocs) and before the 
>> >> full shutdown / restart, we had:
>> >>
>> >> wks53104:Downloads sweiss$ php testReplication.php
>> >> Found 32 mismatched shard counts.
>> >> instock_shard1   replica 1: 30785553 replica 2: 30777568
>> >> instock_shard10   replica 1: 30972662 replica 2: 30966215
>> >> instock_shard11   replica 2: 31036718 replica 1: 31033547
>> >> instock_shard12   replica 1: 30179823 replica 2: 30176067
>> >> instock_shard13   replica 2: 30604638 replica 1: 30599219
>> >> instock_shard14   replica 2: 30755117 replica 1: 30753469
>> >> instock_shard15   replica 2: 30891325 replica 1: 30888771
>> >> instock_shard16   replica 1: 30818260 replica 2: 30811728
>> >> instock_shard17   replica 1: 30422080 replica 2: 30414666
>> >> instock_shard18   replica 2: 30874530 replica 1: 30869977
>> >> instock_shard19   replica 2: 30917008 replica 1: 30913715
>> >> instock_shard2   replica 1: 31062073 replica 2: 31057583
>> >> instock_shard20   replica 1: 30188774 replica 2: 30186565
>> >> instock_shard21   replica 2: 30789012 replica 1: 30784160
>> >> instock_shard22   replica 2: 30820473 replica 1: 30814822
>> >> instock_shard23   replica 2: 30552105 replica 1: 30545802
>> >> instock_shard24   replica 1: 30973906 replica 2: 30971314
>> >> instock_shard25   replica 1: 30732287 replica 2: 30724988
>> >> instock_shard26   replica 1: 31465543 replica 2: 31463414
>> >> instock_shard27   replica 2: 30845514 replica 1: 30842665
>> >> instock_shard28   replica 2: 30549151 replica 1: 30543070
>> >> instock_shard29   replica 2: 30635711 replica 1: 30629240
>> >> instock_shard3   replica 1: 30930400 replica 2: 30928438
>> >> instock_shard30   replica 2: 30902221 replica 1: 30895176
>> >> instock_shard31   replica 2: 31174246 replica 1: 31169998
>> >> instock_shard32   replica 2: 30931550 replica 1: 30926256
>> >> instock_shard4   replica 2: 30755525 replica 1: 30748922
>> >> instock_shard5   replica 2: 31006601 replica 1: 30994316
>> >> instock_shard6   replica 2: 31006531 replica 1: 31003444
>> >> instock_shard7   replica 2: 30737098 replica 1: 30727509
>> >> instock_shard8   replica 2: 30619869 replica 1: 30609084
>> >> instock_shard9   replica 1: 31067833 replica 2: 31061238
>> >>
>> >>
>> >> This stayed consistent for several hours.
>> >>
>> >> After restart:
>> >>
>> >> wks53104:Downloads sweiss$ php testReplication.php
>> >> Found 3 mismatched shard counts.
>> >> instock_shard19   replica 2: 30917008 replica 1: 30913715
>> >> instock_shard22   replica 2: 30820473 replica 1: 30814822
>> >> instock_shard26   replica 1: 31465543 replica 2: 31463414
>> >> wks53104:Downloads sweiss$ php testReplication.php
>> >> Found 2 mismatched shard counts.
>> >> instock_shard19   replica 2: 30917008 replica 1: 30913715
>> >> instock_shard26   replica 1: 31465543 replica 2: 31463414
>> >> wks53104:Downloads sweiss$ php testReplication.php
>> >> Everything looks peachy
>> >>
>> >> Took about a half hour to get there.
>> >>
>> >> Maybe the question should be - any way to get solrcloud to trigger this 
>> >> *without* having to shut down / restart all nodes?  Even if we had to 
>> >> trigger that manually after indexing, it would be fine.  It's a very 
>> >> controlled indexing workflow that only happens once a day.
>> >>
>> >> --
>> >> Steve
>> >>
>> >> On Mon, May 16, 2016 at 6:52 PM, Stephen Weiss 
>> >> <steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>>><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>>>>>
>> >>  wrote:
>> >> Each node has one JVM with 16GB of RAM.  Are you suggesting we would put 
>> >> each shard into a separate JVM (something like 32 nodes)?
>> >>
>> >> We aren't encountering any OOMs.  We are testing this in a separate cloud 
>> >> which no one is even using, the only activity is this very small amount 
>> >> of indexing and still we see this problem.  In the logs, there are no 
>> >> errors at all.  It's almost like none of the recovery features that 
>> >> people say are in Solr, are actually there at all.  I can't find any 
>> >> evidence that Solr is even attempting to keep the shards together.
>> >>
>> >> There are no real errors in the solr log.  I do see some warnings at 
>> >> system startup:
>> >>
>> >> http://pastie.org/private/thz0fbzcxgdreeeune8w
>> >>
>> >> These lines in particular look interesting:
>> >>
>> >> 16925 INFO  
>> >> (recoveryExecutor-3-thread-4-processing-n:172.20.140.173:8983_solr 
>> >> x:instock_shard15_replica1 s:shard15 c:instock r:core_node31) [c:instock 
>> >> s:shard15 r:core_node31 x:instock_shard15_replica1] o.a.s.u.PeerSync 
>> >> PeerSync: core=instock_shard15_replica1 
>> >> url=http://172.20.140.173:8983/solr  Received 0 versions from 
>> >> http://172.20.140.172:8983/solr/instock_shard15_replica2/ 
>> >> fingerprint:{maxVersionSpecified=9223372036854775807, 
>> >> maxVersionEncountered=1534492620385943552, maxInHash=1534492620385943552, 
>> >> versionsHash=-6845461210912808581, numVersions=30888332, 
>> >> numDocs=30888332, maxDoc=37699007}
>> >> 16925 INFO  
>> >> (recoveryExecutor-3-thread-4-processing-n:172.20.140.173:8983_solr 
>> >> x:instock_shard15_replica1 s:shard15 c:instock r:core_node31) [c:instock 
>> >> s:shard15 r:core_node31 x:instock_shard15_replica1] o.a.s.u.PeerSync 
>> >> PeerSync: core=instock_shard15_replica1 
>> >> url=http://172.20.140.173:8983/solr DONE. sync failed
>> >> 16925 INFO  
>> >> (recoveryExecutor-3-thread-4-processing-n:172.20.140.173:8983_solr 
>> >> x:instock_shard15_replica1 s:shard15 c:instock r:core_node31) [c:instock 
>> >> s:shard15 r:core_node31 x:instock_shard15_replica1] 
>> >> o.a.s.c.RecoveryStrategy PeerSync Recovery was not successful - trying 
>> >> replication.
>> >>
>> >> This is the first node to start up, so most of the other shards are not 
>> >> there yet.
>> >>
>> >> On another node (the last node to start up), it looks similar but a 
>> >> little different:
>> >>
>> >> http://pastie.org/private/xjw0ruljcurdt4xpzqk6da
>> >>
>> >> 74090 INFO  
>> >> (recoveryExecutor-3-thread-1-processing-n:172.20.140.177:8983_solr 
>> >> x:instock_shard25_replica2 s:shard25 c:instock r:core_node60) [c:instock 
>> >> s:shard25 r:core_node60 x:instock_shard25_replica2] 
>> >> o.a.s.c.RecoveryStrategy Attempting to PeerSync from 
>> >> [http://172.20.140.170:8983/solr/instock_shard25_replica1/] - 
>> >> recoveringAfterStartup=[true]
>> >> 74091 INFO  
>> >> (recoveryExecutor-3-thread-1-processing-n:172.20.140.177:8983_solr 
>> >> x:instock_shard25_replica2 s:shard25 c:instock r:core_node60) [c:instock 
>> >> s:shard25 r:core_node60 x:instock_shard25_replica2] o.a.s.u.PeerSync 
>> >> PeerSync: core=instock_shard25_replica2 
>> >> url=http://172.20.140.177:8983/solr START 
>> >> replicas=[http://172.20.140.170:8983/solr/instock_shard25_replica1/] 
>> >> nUpdates=100
>> >> 74091 WARN  
>> >> (recoveryExecutor-3-thread-1-processing-n:172.20.140.177:8983_solr 
>> >> x:instock_shard25_replica2 s:shard25 c:instock r:core_node60) [c:instock 
>> >> s:shard25 r:core_node60 x:instock_shard25_replica2] o.a.s.u.PeerSync no 
>> >> frame of reference to tell if we've missed updates
>> >> 74091 INFO  
>> >> (recoveryExecutor-3-thread-1-processing-n:172.20.140.177:8983_solr 
>> >> x:instock_shard25_replica2 s:shard25 c:instock r:core_node60) [c:instock 
>> >> s:shard25 r:core_node60 x:instock_shard25_replica2] 
>> >> o.a.s.c.RecoveryStrategy PeerSync Recovery was not successful - trying 
>> >> replication.
>> >>
>> >> Every single replica shows errors like this (either one or the other).
>> >>
>> >> I should add, beyond the block joins / nested children & grandchildren, 
>> >> there's really nothing unusual about this cloud at all.  It's a very 
>> >> basic collection (simple enough it can be created in the GUI) and a dist 
>> >> installation of Solr 6.  There are 3 independent zookeeper servers 
>> >> (again, vanilla from dist), and there don't appear to be any zookeeper 
>> >> issues.
>> >>
>> >> --
>> >> Steve
>> >>
>> >> On Mon, May 16, 2016 at 12:02 PM, Erick Erickson 
>> >> <erickerick...@gmail.com<mailto:erickerick...@gmail.com><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com>><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com>>><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com>><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com><mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com>>>>>
>> >>  wrote:
>> >> 8 nodes, 4 shards apiece? All in the same JVM? People have gotten by
>> >> the GC pain by running in separate JVMs with less Java memory each on
>> >> big beefy machines.... That's not a recommendation as much as an
>> >> observation.
>> >>
>> >> That aside, unless you have some very strange stuff going on this is
>> >> totally weird. Are you hitting OOM errors at any time you have this
>> >> problem? Once you hit an OOM error, all bets are off about how Java
>> >> behaves. If you are hitting those, you can't hope for stability until
>> >> you fix that issue. In your writeup there's some evidence for this
>> >> when you say that if you index multiple docs at a time you get
>> >> failures.
>> >>
>> >> Do your Solr logs show any anomalies? My guess is that you'll see
>> >> exceptions in your Solr logs that will shed light on the issue.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, May 16, 2016 at 8:03 AM, Stephen Weiss 
>> >> <steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>>><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com><mailto:steve.we...@wgsn.com<mailto:steve.we...@wgsn.com>>>>>
>> >>  wrote:
>> >>> Hi everyone,
>> >>>
>> >>> I'm running into a problem with SolrCloud replicas and thought I would 
>> >>> ask the list to see if anyone else has seen this / gotten past it.
>> >>>
>> >>> Right now, we are running with only one replica per shard.  This is 
>> >>> obviously a problem because if one node goes down anywhere, the whole 
>> >>> collection goes offline, and due to garbage collection issues, this 
>> >>> happens about once or twice a week, causing a great deal of instability. 
>> >>>  If we try to increase to 2 replicas per shard, once we index new 
>> >>> documents and the shards autocommit, the shards all get out of sync with 
>> >>> each other, with different numbers of documents, different numbers of 
>> >>> documents deleted, different facet counts - pretty much totally 
>> >>> divergent indexes.  Shards always show green and available, and never go 
>> >>> into recovery or any other state as to indicate there's a mismatch.  
>> >>> There are also no errors in the logs to indicate anything is going 
>> >>> wrong.  Even long after indexing has finished, the replicas never come 
>> >>> back into sync.  The only way to get consistency again is to delete one 
>> >>> set of replicas and then add them back in.  Unfortunately, when we do 
>> >>> this, we invariabl
> y discover that many documents (2-3%) are missing from the index.
>> >>>
>> >>> We have tried setting the min_rf parameter, and have found that when 
>> >>> setting min_rf=2, we almost never get back rf=2.  We almost always get 
>> >>> rf=1, resend the request, and it basically just goes into an infinite 
>> >>> loop.  The only way to get rf=2 to come back is to only index one 
>> >>> document at a time.  Unfortunately, we have to update millions of 
>> >>> documents a day and it isn't really feasible to index this way, and even 
>> >>> when indexing one document at a time, we still occasionally find 
>> >>> ourselves in an infinite loop.  This doesn't appear to be related to the 
>> >>> documents we are indexing - if we stop the index process and bounce 
>> >>> solr, the exact same document will go through fine the next time until 
>> >>> indexing stops up on another random document.
>> >>>
>> >>> We have 8 nodes, with 4 shards a piece, all running one collection with 
>> >>> about 900M documents.  An important note is that we have a block join 
>> >>> system with 3 tiers of documents (products -> skus -> sku_history).  
>> >>> During indexing, we are forced to delete all documents for a product 
>> >>> prior to adding the product back into the index, in order to avoid 
>> >>> orphaned children / grandchildren.  All documents are consistently 
>> >>> indexed with the top-level product ID so that we can delete all 
>> >>> child/grandchild documents prior to updating the document.  So, for each 
>> >>> updated document, we are sending through a delete call followed by an 
>> >>> add call.  We have tried putting both the delete and add in the same 
>> >>> update request with the same results.
>> >>>
>> >>> All we see out there on Google is that none of what we're seeing should 
>> >>> be happening.
>> >>>
>> >>> We are currently running Solr 6.0 with Zookeeper 3.4.6.  We experienced 
>> >>> the same behavior on 5.4 as well.
>> >>>
>> >>> --
>> >>> Steve
>> >>>
>> >>> ________________________________
>> >>>
>> >>> WGSN is a global foresight business. Our experts provide deep insight 
>> >>> and analysis of consumer, fashion and design trends. We inspire our 
>> >>> clients to plan and trade their range with unparalleled confidence and 
>> >>> accuracy. Together, we Create Tomorrow.
>> >>>
>> >>> WGSN<http://www.wgsn.com/> is part of WGSN Limited, comprising of 
>> >>> market-leading products including WGSN.com<http://www.wgsn.com>, WGSN 
>> >>> Lifestyle & Interiors<http://www.wgsn.com/en/lifestyle-interiors>, WGSN 
>> >>> INstock<http://www.wgsninstock.com/>, WGSN 
>> >>> StyleTrial<http://www.wgsn.com/en/styletrial/> and WGSN 
>> >>> Mindset<http://www.wgsn.com/en/services/consultancy/>, our bespoke 
>> >>> consultancy services.
>> >>>
>> >>> The information in or attached to this email is confidential and may be 
>> >>> legally privileged. If you are not the intended recipient of this 
>> >>> message, any use, disclosure, copying, distribution or any action taken 
>> >>> in reliance on it is prohibited and may be unlawful. If you have 
>> >>> received this message in error, please notify the sender immediately by 
>> >>> return email and delete this message and any copies from your computer 
>> >>> and network. WGSN does not warrant that this email and any attachments 
>> >>> are free from viruses and accepts no liability for any loss resulting 
>> >>> from infected email transmissions.
>> >>>
>> >>> WGSN reserves the right to monitor all email through its networks. Any 
>> >>> views expressed may be those of the originator and not necessarily of 
>> >>> WGSN. WGSN is powered by Ascential plc<http://www.ascential.com>, which 
>> >>> transforms knowledge businesses to deliver exceptional performance.
>> >>>
>> >>> Please be advised all phone calls may be recorded for training and 
>> >>> quality purposes and by accepting and/or making calls from and/or to us 
>> >>> you acknowledge and agree to calls being recorded.
>> >>>
>> >>> WGSN Limited, Company number 4858491
>> >>>
>> >>> registered address:
>> >>>
>> >>> Ascential plc, The Prow, 1 Wilder Walk, London W1B 5AP
>> >>>
>> >>> WGSN Inc., tax ID 04-3851246, registered office c/o National Registered 
>> >>> Agents, Inc., 160 Greentree Drive, Suite 101, Dover DE 19904, United 
>> >>> States
>> >>>
>> >>> 4C Serviços de Informação Ltda., CNPJ/MF (Taxpayer's Register): 
>> >>> 15.536.968/0001-04, Address: Avenida Cidade Jardim, 377, 7˚ andar CEP 
>> >>> 01453-000, Itaim Bibi, São Paulo
>> >>>
>> >>> 4C Business Information Consulting (Shanghai) Co., Ltd, 
>> >>> 富新商务信息咨询（上海）有限公司, registered address Unit 4810/4811, 48/F Tower 1, Grand 
>> >>> Gateway, 1 Hong Qiao Road, Xuhui District, Shanghai
>> >>
>> >>
>> >>
>> >> ________________________________
>> >>
>> >> WGSN is a global foresight business. Our experts provide deep insight and 
>> >> analysis of consumer, fashion and design trends. We inspire our clients 
>> >> to plan and trade their range with unparalleled confidence and accuracy. 
>> >> Together, we Create Tomorrow.
>> >>
>> >> WGSN<http://www.wgsn.com/> is part of WGSN Limited, comprising of 
>> >> market-leading products including WGSN.com<http://www.wgsn.com>, WGSN 
>> >> Lifestyle & Interiors<http://www.wgsn.com/en/lifestyle-interiors>, WGSN 
>> >> INstock<http://www.wgsninstock.com/>, WGSN 
>> >> StyleTrial<http://www.wgsn.com/en/styletrial/> and WGSN 
>> >> Mindset<http://www.wgsn.com/en/services/consultancy/>, our bespoke 
>> >> consultancy services.
>> >>
>> >> The information in or attached to this email is confidential and may be 
>> >> legally privileged. If you are not the intended recipient of this 
>> >> message, any use, disclosure, copying, distribution or any action taken 
>> >> in reliance on it is prohibited and may be unlawful. If you have received 
>> >> this message in error, please notify the sender immediately by return 
>> >> email and delete this message and any copies from your computer and 
>> >> network. WGSN does not warrant that this email and any attachments are 
>> >> free from viruses and accepts no liability for any loss resulting from 
>> >> infected email transmissions.
>> >>
>> >> WGSN reserves the right to monitor all email through its networks. Any 
>> >> views expressed may be those of the originator and not necessarily of 
>> >> WGSN. WGSN is powered by Ascential plc<http://www.ascential.com>, which 
>> >> transforms knowledge businesses to deliver exceptional performance.
>> >>
>> >> Please be advised all phone calls may be recorded for training and 
>> >> quality purposes and by accepting and/or making calls from and/or to us 
>> >> you acknowledge and agree to calls being recorded.
>> >>
>> >> WGSN Limited, Company number 4858491
>> >>
>> >> registered address:
>> >>
>> >> Ascential plc, The Prow, 1 Wilder Walk, London W1B 5AP
>> >>
>> >> WGSN Inc., tax ID 04-3851246, registered office c/o National Registered 
>> >> Agents, Inc., 160 Greentree Drive, Suite 101, Dover DE 19904, United 
>> >> States
>> >>
>> >> 4C Serviços de Informação Ltda., CNPJ/MF (Taxpayer's Register): 
>> >> 15.536.968/0001-04, Address: Avenida Cidade Jardim, 377, 7˚ andar CEP 
>> >> 01453-000, Itaim Bibi, São Paulo
>> >>
>> >> 4C Business Information Consulting (Shanghai) Co., Ltd, 富新商务信息咨询（上海）有限公司, 
>> >> registered address Unit 4810/4811, 48/F Tower 1, Grand Gateway, 1 Hong 
>> >> Qiao Road, Xuhui District, Shanghai
>> >
>> >
>> > ________________________________
>> >
>> > WGSN is a global foresight business. Our experts provide deep insight and 
>> > analysis of consumer, fashion and design trends. We inspire our clients to 
>> > plan and trade their range with unparalleled confidence and accuracy. 
>> > Together, we Create Tomorrow.
>> >
>> > WGSN<http://www.wgsn.com/> is part of WGSN Limited, comprising of 
>> > market-leading products including WGSN.com<http://www.wgsn.com>, WGSN 
>> > Lifestyle & Interiors<http://www.wgsn.com/en/lifestyle-interiors>, WGSN 
>> > INstock<http://www.wgsninstock.com/>, WGSN 
>> > StyleTrial<http://www.wgsn.com/en/styletrial/> and WGSN 
>> > Mindset<http://www.wgsn.com/en/services/consultancy/>, our bespoke 
>> > consultancy services.
>> >
>> > The information in or attached to this email is confidential and may be 
>> > legally privileged. If you are not the intended recipient of this message, 
>> > any use, disclosure, copying, distribution or any action taken in reliance 
>> > on it is prohibited and may be unlawful. If you have received this message 
>> > in error, please notify the sender immediately by return email and delete 
>> > this message and any copies from your computer and network. WGSN does not 
>> > warrant that this email and any attachments are free from viruses and 
>> > accepts no liability for any loss resulting from infected email 
>> > transmissions.
>> >
>> > WGSN reserves the right to monitor all email through its networks. Any 
>> > views expressed may be those of the originator and not necessarily of 
>> > WGSN. WGSN is powered by Ascential plc<http://www.ascential.com>, which 
>> > transforms knowledge businesses to deliver exceptional performance.
>> >
>> > Please be advised all phone calls may be recorded for training and quality 
>> > purposes and by accepting and/or making calls from and/or to us you 
>> > acknowledge and agree to calls being recorded.
>> >
>> > WGSN Limited, Company number 4858491
>> >
>> > registered address:
>> >
>> > Ascential plc, The Prow, 1 Wilder Walk, London W1B 5AP
>> >
>> > WGSN Inc., tax ID 04-3851246, registered office c/o National Registered 
>> > Agents, Inc., 160 Greentree Drive, Suite 101, Dover DE 19904, United States
>> >
>> > 4C Serviços de Informação Ltda., CNPJ/MF (Taxpayer's Register): 
>> > 15.536.968/0001-04, Address: Avenida Cidade Jardim, 377, 7˚ andar CEP 
>> > 01453-000, Itaim Bibi, São Paulo
>> >
>> > 4C Business Information Consulting (Shanghai) Co., Ltd, 富新商务信息咨询（上海）有限公司, 
>> > registered address Unit 4810/4811, 48/F Tower 1, Grand Gateway, 1 Hong 
>> > Qiao Road, Xuhui District, Shanghai
>>
>>
>>
>> ________________________________
>>
>> WGSN is a global foresight business. Our experts provide deep insight and 
>> analysis of consumer, fashion and design trends. We inspire our clients to 
>> plan and trade their range with unparalleled confidence and accuracy. 
>> Together, we Create Tomorrow.
>>
>> WGSN<http://www.wgsn.com/> is part of WGSN Limited, comprising of 
>> market-leading products including WGSN.com<http://www.wgsn.com>, WGSN 
>> Lifestyle & Interiors<http://www.wgsn.com/en/lifestyle-interiors>, WGSN 
>> INstock<http://www.wgsninstock.com/>, WGSN 
>> StyleTrial<http://www.wgsn.com/en/styletrial/> and WGSN 
>> Mindset<http://www.wgsn.com/en/services/consultancy/>, our bespoke 
>> consultancy services.
>>
>> The information in or attached to this email is confidential and may be 
>> legally privileged. If you are not the intended recipient of this message, 
>> any use, disclosure, copying, distribution or any action taken in reliance 
>> on it is prohibited and may be unlawful. If you have received this message 
>> in error, please notify the sender immediately by return email and delete 
>> this message and any copies from your computer and network. WGSN does not 
>> warrant that this email and any attachments are free from viruses and 
>> accepts no liability for any loss resulting from infected email 
>> transmissions.
>>
>> WGSN reserves the right to monitor all email through its networks. Any views 
>> expressed may be those of the originator and not necessarily of WGSN. WGSN 
>> is powered by Ascential plc<http://www.ascential.com>, which transforms 
>> knowledge businesses to deliver exceptional performance.
>>
>> Please be advised all phone calls may be recorded for training and quality 
>> purposes and by accepting and/or making calls from and/or to us you 
>> acknowledge and agree to calls being recorded.
>>
>> WGSN Limited, Company number 4858491
>>
>> registered address:
>>
>> Ascential plc, The Prow, 1 Wilder Walk, London W1B 5AP
>>
>> WGSN Inc., tax ID 04-3851246, registered office c/o National Registered 
>> Agents, Inc., 160 Greentree Drive, Suite 101, Dover DE 19904, United States
>>
>> 4C Serviços de Informação Ltda., CNPJ/MF (Taxpayer's Register): 
>> 15.536.968/0001-04, Address: Avenida Cidade Jardim, 377, 7˚ andar CEP 
>> 01453-000, Itaim Bibi, São Paulo
>>
>> 4C Business Information Consulting (Shanghai) Co., Ltd, 富新商务信息咨询（上海）有限公司, 
>> registered address Unit 4810/4811, 48/F Tower 1, Grand Gateway, 1 Hong Qiao 
>> Road, Xuhui District, Shanghai
>>
>
>
>________________________________
>
>WGSN is a global foresight business. Our experts provide deep insight and 
>analysis of consumer, fashion and design trends. We inspire our clients to 
>plan and trade their range with unparalleled confidence and accuracy. 
>Together, we Create Tomorrow.
>
>WGSN<http://www.wgsn.com/> is part of WGSN Limited, comprising of 
>market-leading products including WGSN.com<http://www.wgsn.com>, WGSN 
>Lifestyle & Interiors<http://www.wgsn.com/en/lifestyle-interiors>, WGSN 
>INstock<http://www.wgsninstock.com/>, WGSN 
>StyleTrial<http://www.wgsn.com/en/styletrial/> and WGSN 
>Mindset<http://www.wgsn.com/en/services/consultancy/>, our bespoke consultancy 
>services.
>
>The information in or attached to this email is confidential and may be 
>legally privileged. If you are not the intended recipient of this message, any 
>use, disclosure, copying, distribution or any action taken in reliance on it 
>is prohibited and may be unlawful. If you have received this message in error, 
>please notify the sender immediately by return email and delete this message 
>and any copies from your computer and network. WGSN does not warrant that this 
>email and any attachments are free from viruses and accepts no liability for 
>any loss resulting from infected email transmissions.
>
>WGSN reserves the right to monitor all email through its networks. Any views 
>expressed may be those of the originator and not necessarily of WGSN. WGSN is 
>powered by Ascential plc<http://www.ascential.com>, which transforms knowledge 
>businesses to deliver exceptional performance.
>
>Please be advised all phone calls may be recorded for training and quality 
>purposes and by accepting and/or making calls from and/or to us you 
>acknowledge and agree to calls being recorded.
>
>WGSN Limited, Company number 4858491
>
>registered address:
>
>Ascential plc, The Prow, 1 Wilder Walk, London W1B 5AP
>
>WGSN Inc., tax ID 04-3851246, registered office c/o National Registered 
>Agents, Inc., 160 Greentree Drive, Suite 101, Dover DE 19904, United States
>
>4C Serviços de Informação Ltda., CNPJ/MF (Taxpayer's Register): 
>15.536.968/0001-04, Address: Avenida Cidade Jardim, 377, 7˚ andar CEP 
>01453-000, Itaim Bibi, São Paulo
>
>4C Business Information Consulting (Shanghai) Co., Ltd, 富新商务信息咨询（上海）有限公司, 
>registered address Unit 4810/4811, 48/F Tower 1, Grand Gateway, 1 Hong Qiao 
>Road, Xuhui District, Shanghai

Re: SolrCloud replicas consistently out of sync

Reply via email to