Erick Erickson - I don't have much time to chase this down. Do you think this a blocker for 7.6? It seems pretty serious.
Jeremy - This would be a good JIRA to create - we can move the conversation there to try to get the right people involved. Kevin Risden On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith <jas2...@cornell.edu> wrote: > Hi Susheel, > > Yes, it appears that under certain conditions, if a follower is down > when the leader gets an update, the follower will not receive that update > when it comes back (or maybe it receives the update and it's then > overwritten by its own transaction logs, I'm not sure). Furthermore, if > that follower then becomes the leader, it will replicate its own out of > date value back to the former leader, even though the version number is > lower. > > > -Jeremy > > ________________________________ > From: Susheel Kumar <susheel2...@gmail.com> > Sent: Thursday, November 1, 2018 2:57:00 PM > To: solr-user@lucene.apache.org > Subject: Re: SolrCloud Replication Failure > > Are we saying it has to do something with stop and restarting replica's > otherwise I haven't seen/heard any issues with document updates and > forwarding to replica's... > > Thanks, > Susheel > > On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson <erickerick...@gmail.com> > wrote: > > > So this seems like it absolutely needs a JIRA.... > > On Thu, Nov 1, 2018 at 9:39 AM > Kevin Risden > <kris...@apache.org> wrote: > > > > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5 > > locally > > > without docker. I still see the same behavior where the latest updates > > > aren't on the replicas. I still don't know what is happening but it > > happens > > > without Docker :( > > > > > > > > > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches > > > > > > Kevin Risden > > > > > > > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden <kris...@apache.org> > wrote: > > > > > > > Erick - Yea thats a fair point. Would be interesting to see if this > > fails > > > > without Docker. > > > > > > > > Kevin Risden > > > > > > > > > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson < > > erickerick...@gmail.com> > > > > wrote: > > > > > > > >> Kevin: > > > >> > > > >> You're also using Docker, right? Docker is not "officially" > supported > > > >> although there's some movement in that direction and if this is only > > > >> reproducible in Docker than it's a clue where to look.... > > > >> > > > >> Erick > > > >> On Wed, Oct 31, 2018 at 7:24 PM > > > >> Kevin Risden > > > >> <kris...@apache.org> wrote: > > > >> > > > > >> > I haven't dug into why this is happening but it definitely > > reproduces. I > > > >> > removed the local requirements (port mapping and such) from the > > gist you > > > >> > posted (very helpful). I confirmed this fails locally and on > Travis > > CI. > > > >> > > > > >> > > https://github.com/risdenk/test-solr-start-stop-replica-consistency > > > >> > > > > >> > I don't even see the first update getting applied from num 10 -> > 20. > > > >> After > > > >> > the first update there is no more change. > > > >> > > > > >> > Kevin Risden > > > >> > > > > >> > > > > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <jas2...@cornell.edu > > > > > >> wrote: > > > >> > > > > >> > > Thanks Erick, this is 7.5.0. > > > >> > > ________________________________ > > > >> > > From: Erick Erickson <erickerick...@gmail.com> > > > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM > > > >> > > To: solr-user > > > >> > > Subject: Re: SolrCloud Replication Failure > > > >> > > > > > >> > > What version of solr? This code was pretty much rewriten in 7.3 > > IIRC > > > >> > > > > > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <jas2...@cornell.edu > > wrote: > > > >> > > > > > >> > > > Hi all, > > > >> > > > > > > >> > > > We are currently running a moderately large instance of > > > >> standalone > > > >> > > > solr and are preparing to switch to solr cloud to help us > scale > > > >> up. I > > > >> > > have > > > >> > > > been running a number of tests using docker locally and ran > > into an > > > >> issue > > > >> > > > where replication is consistently failing. I have pared down > > the > > > >> test > > > >> > > case > > > >> > > > as minimally as I could. Here's a link for the > > docker-compose.yml > > > >> (I put > > > >> > > > it in a directory called solrcloud_simple) and a script to run > > the > > > >> test: > > > >> > > > > > > >> > > > > > > >> > > > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489 > > > >> > > > > > > >> > > > > > > >> > > > Here's the basic idea behind the test: > > > >> > > > > > > >> > > > > > > >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, > > and 2 > > > >> > > > replicas (each node gets a replica). Just use the default > > schema, > > > >> > > although > > > >> > > > I've also tried our schema and got the same result. > > > >> > > > > > > >> > > > > > > >> > > > 2) Shut down solr-2 > > > >> > > > > > > >> > > > > > > >> > > > 3) Add 100 simple docs, just id and a field called num. > > > >> > > > > > > >> > > > > > > >> > > > 4) Start solr-2 and check that it received the documents. It > > did! > > > >> > > > > > > >> > > > > > > >> > > > 5) Update a document, commit, and check that solr-2 received > the > > > >> update. > > > >> > > > It did! > > > >> > > > > > > >> > > > > > > >> > > > 6) Stop solr-2, update the same document, start solr-2, and > make > > > >> sure > > > >> > > that > > > >> > > > it received the update. It did! > > > >> > > > > > > >> > > > > > > >> > > > 7) Repeat step 6 with a new value. This time solr-2 reverts > > back > > > >> to what > > > >> > > > it had in step 5. > > > >> > > > > > > >> > > > > > > >> > > > I believe the main issue comes from this in the logs: > > > >> > > > > > > >> > > > > > > >> > > > solr-2_1 | 2018-10-31 17:04:26.135 INFO > > > >> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr > > > >> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test > > > >> s:shard1 > > > >> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync > > PeerSync: > > > >> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr Our > > > >> versions > > > >> > > are > > > >> > > > newer. ourHighThreshold=1615861330901729280 > > > >> > > > otherLowThreshold=1615861314086764545 > > ourHighest=1615861330901729280 > > > >> > > > otherHighest=1615861335081353216 > > > >> > > > > > > >> > > > PeerSync thinks the versions on solr-2 are newer for some > > reason, > > > >> so it > > > >> > > > doesn't try to sync from solr-1. In the final state, solr-2 > > will > > > >> always > > > >> > > > have a lower version for the updated doc than solr-1. I've > > tried > > > >> this > > > >> > > with > > > >> > > > different commit strategies, both auto and manual, and it > > doesn't > > > >> seem to > > > >> > > > make any difference. > > > >> > > > > > > >> > > > Is this a bug with solr, an issue with using docker, or am I > > just > > > >> > > > expecting too much from solr? > > > >> > > > > > > >> > > > Thanks for any insights you may have, > > > >> > > > > > > >> > > > Jeremy > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > > >