Re: REBALANCELEADERS is not reliable

Erick Erickson Fri, 21 Dec 2018 09:01:14 -0800

I looked at the test last night and it's...disturbing. It succeeds
100% of the time. Manual testing seems to fail very often.
Of course it was late and I was a bit cross-eyed, so maybe
I wasn't looking at the manual tests correctly. Or maybe the
test is buggy.


I beasted the test 100x last night and all of them succeeded.

This was with all NRT replicas.

Today I'm going to modify the test into a stand-alone program
to see if it's something in the test environment that causes
it to succeed. I've got to get this to fail as a unit test before I
have confidence in any fixes, and also confidence that things
like this will be caught going forward.

Erick

On Fri, Dec 21, 2018 at 3:59 AM Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:
>
> As far as I could see with debugger there is still a problem in requeing.
>
> There is a watcher and it is recognized that the watcher is not a 
> preferredleader.
> So it tries to locate a preferredleader with success.
> It then calls makeReplicaFirstWatcher and gets a new sequence number for
> the preferredleader replica. But now we have two replicas with the same
> sequence number. One replica which already owns that sequence number and
> the replica which got the new (and the same) number as new sequence number.
> It now tries to solve this with queueNodesWithSameSequence.
> Might be something in rejoinElection.
> At least the call to rejoinElection seems right. For preferredleader it
> is true for rejoinAtHead and for the other replica with same sequence number
> it is false for rejoinAtHead.
>
> A test case should have 3 shards with 3 cores per shard and should try to
> set preferredleader to different replicas at random. And then try to
> rebalance and check the results.
>
> So far, regards, Bernd
>
>
> Am 21.12.18 um 07:11 schrieb Erick Erickson:
> > I'm reworking the test case, so hold off on doing that. If you want to
> > raise a JIRA, though. please do and attach your patch...
> >
> > On Thu, Dec 20, 2018 at 10:53 AM Erick Erickson <erickerick...@gmail.com> 
> > wrote:
> >>
> >> Nothing that I know of was _intentionally_ changed with this between
> >> 6x and 7x. That said, nothing that I know of was done to verify that
> >> TLOG and PULL replicas (added in 7x) were handled correctly. There's a
> >> test "TestRebalanceLeaders" for this functionality that has run since
> >> the feature was put in, but it has _not_ been modified to create TLOG
> >> and PULL replicas and test with those.
> >>
> >> For this patch to be complete, we should either extend that test or
> >> make another that fails without this patch and succeeds with it.
> >>
> >> I'd probably recommend modifying TestRebalanceLeaders to randomly
> >> create TLOG and (maybe) PULL replicas so we'd keep covering the
> >> various cases.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >> On Thu, Dec 20, 2018 at 8:06 AM Bernd Fehling
> >> <bernd.fehl...@uni-bielefeld.de> wrote:
> >>>
> >>> Hi Vadim,
> >>> I just tried it with 6.6.5.
> >>> In my test cloud with 5 shards, 5 nodes, 3 cores per node it missed
> >>> one shard to become leader. But noticed that one shard already was
> >>> leader. No errors or exceptions in logs.
> >>> May be I should enable debug logging and try again to see all logging
> >>> messages from the patch.
> >>>
> >>> Might be they also changed other parts between 6.6.5 and 7.6.0 so that
> >>> it works for you.
> >>>
> >>> I also just changed from zookeeper 3.4.10 to 3.4.13 which works fine,
> >>> even with 3.4.10 dataDir. No errors no complains. Seems to be compatible.
> >>>
> >>> Regards, Bernd
> >>>
> >>>
> >>> Am 20.12.18 um 12:31 schrieb Vadim Ivanov:
> >>>> Yes! It works!
> >>>> I have tested RebalanceLeaders today with the patch provided by Endika 
> >>>> Posadas. 
> >>>> (http://lucene.472066.n3.nabble.com/Rebalance-Leaders-Leader-node-deleted-when-rebalancing-leaders-td4417040.html)
> >>>> And at last it works as expected on my collection with 5 nodes and about 
> >>>> 400 shards.
> >>>> Original patch was slightly incompatible with 7.6.0
> >>>> I hope this patch will help to try this feature with 7.6
> >>>> https://drive.google.com/file/d/19z_MPjxItGyghTjXr6zTCVsiSJg1tN20
> >>>>
> >>>> RebalanceLeaders was not very useful feature before 7.0 (as all replicas 
> >>>> were NRT)
> >>>> But new replica types made it very helpful to keep big clusters in 
> >>>> order...
> >>>>
> >>>> I wonder, why there is no any jira about this case (or maybe I missed 
> >>>> it)?
> >>>> Anyone who cares, please, help to create jira and improve this feature 
> >>>> in the nearest releaase
> >>>>

Re: REBALANCELEADERS is not reliable

Reply via email to