I looked at the test last night and it's...disturbing. It succeeds 100% of the time. Manual testing seems to fail very often. Of course it was late and I was a bit cross-eyed, so maybe I wasn't looking at the manual tests correctly. Or maybe the test is buggy.
I beasted the test 100x last night and all of them succeeded. This was with all NRT replicas. Today I'm going to modify the test into a stand-alone program to see if it's something in the test environment that causes it to succeed. I've got to get this to fail as a unit test before I have confidence in any fixes, and also confidence that things like this will be caught going forward. Erick On Fri, Dec 21, 2018 at 3:59 AM Bernd Fehling <bernd.fehl...@uni-bielefeld.de> wrote: > > As far as I could see with debugger there is still a problem in requeing. > > There is a watcher and it is recognized that the watcher is not a > preferredleader. > So it tries to locate a preferredleader with success. > It then calls makeReplicaFirstWatcher and gets a new sequence number for > the preferredleader replica. But now we have two replicas with the same > sequence number. One replica which already owns that sequence number and > the replica which got the new (and the same) number as new sequence number. > It now tries to solve this with queueNodesWithSameSequence. > Might be something in rejoinElection. > At least the call to rejoinElection seems right. For preferredleader it > is true for rejoinAtHead and for the other replica with same sequence number > it is false for rejoinAtHead. > > A test case should have 3 shards with 3 cores per shard and should try to > set preferredleader to different replicas at random. And then try to > rebalance and check the results. > > So far, regards, Bernd > > > Am 21.12.18 um 07:11 schrieb Erick Erickson: > > I'm reworking the test case, so hold off on doing that. If you want to > > raise a JIRA, though. please do and attach your patch... > > > > On Thu, Dec 20, 2018 at 10:53 AM Erick Erickson <erickerick...@gmail.com> > > wrote: > >> > >> Nothing that I know of was _intentionally_ changed with this between > >> 6x and 7x. That said, nothing that I know of was done to verify that > >> TLOG and PULL replicas (added in 7x) were handled correctly. There's a > >> test "TestRebalanceLeaders" for this functionality that has run since > >> the feature was put in, but it has _not_ been modified to create TLOG > >> and PULL replicas and test with those. > >> > >> For this patch to be complete, we should either extend that test or > >> make another that fails without this patch and succeeds with it. > >> > >> I'd probably recommend modifying TestRebalanceLeaders to randomly > >> create TLOG and (maybe) PULL replicas so we'd keep covering the > >> various cases. > >> > >> Best, > >> Erick > >> > >> > >> On Thu, Dec 20, 2018 at 8:06 AM Bernd Fehling > >> <bernd.fehl...@uni-bielefeld.de> wrote: > >>> > >>> Hi Vadim, > >>> I just tried it with 6.6.5. > >>> In my test cloud with 5 shards, 5 nodes, 3 cores per node it missed > >>> one shard to become leader. But noticed that one shard already was > >>> leader. No errors or exceptions in logs. > >>> May be I should enable debug logging and try again to see all logging > >>> messages from the patch. > >>> > >>> Might be they also changed other parts between 6.6.5 and 7.6.0 so that > >>> it works for you. > >>> > >>> I also just changed from zookeeper 3.4.10 to 3.4.13 which works fine, > >>> even with 3.4.10 dataDir. No errors no complains. Seems to be compatible. > >>> > >>> Regards, Bernd > >>> > >>> > >>> Am 20.12.18 um 12:31 schrieb Vadim Ivanov: > >>>> Yes! It works! > >>>> I have tested RebalanceLeaders today with the patch provided by Endika > >>>> Posadas. > >>>> (http://lucene.472066.n3.nabble.com/Rebalance-Leaders-Leader-node-deleted-when-rebalancing-leaders-td4417040.html) > >>>> And at last it works as expected on my collection with 5 nodes and about > >>>> 400 shards. > >>>> Original patch was slightly incompatible with 7.6.0 > >>>> I hope this patch will help to try this feature with 7.6 > >>>> https://drive.google.com/file/d/19z_MPjxItGyghTjXr6zTCVsiSJg1tN20 > >>>> > >>>> RebalanceLeaders was not very useful feature before 7.0 (as all replicas > >>>> were NRT) > >>>> But new replica types made it very helpful to keep big clusters in > >>>> order... > >>>> > >>>> I wonder, why there is no any jira about this case (or maybe I missed > >>>> it)? > >>>> Anyone who cares, please, help to create jira and improve this feature > >>>> in the nearest releaase > >>>>