Re: REBALANCELEADERS is not reliable

2019-01-21 Thread Bernd Fehling
Hi Erik, patches and the new comments look good. Unfortunately I'm at 6.6.5 and can't test this with my cloud. Replica (o.a.s.common.cloud.Replica) at 6.6.5 is to far away from 7.6 and up. And a backport for 6.6.5 is to much rework, if possible at all. Thanks for solving this issue. Regards, Be

Re: REBALANCELEADERS is not reliable

2019-01-20 Thread Erick Erickson
Bernd: I just committed fixes on SOLR-13091 and SOLR-10935 to the repo, if you wanted to give it a whirl it's ready. By tonight (Sunday) I expect to change the response format a bit and update the ref guide, although you'll have to look at the doc changes in the format. There's a new summary secti

Re: REBALANCELEADERS is not reliable

2019-01-13 Thread Erick Erickson
Bernd: I just attached a patch to https://issues.apache.org/jira/browse/SOLR-13091. It's still rough, the response from REBALANCELEADERS needs quite a bit of work (lots of extra stuff in it now, and no overall verification). I haven't run all the tests, nor precommit. I wanted to get something up

Re: REBALANCELEADERS is not reliable

2019-01-11 Thread Erick Erickson
bq: You have to check if the cores, participating in leadership election, are _really_ in sync. And this must be done before starting any rebalance. Sounds ugly... :-( This _should_ not be necessary. I'll add parenthetically that leader election has been extensively re-worked in Solr 7.3+ though b

Re: REBALANCELEADERS is not reliable

2019-01-11 Thread Bernd Fehling
Hi Erik, yes, I would be happy to test any patches. Good news, I got rebalance working. After running the rebalance about 50 times with debugger and watching the behavior of my problem shard and its core_nodes within my test cloud I came to the point of failure. I solved it and now it works. Bad

Re: REBALANCELEADERS is not reliable

2019-01-10 Thread Erick Erickson
Bernd: Don't feel bad about missing it, I wrote the silly stuff and it took me some time to remember. Those are the rules. It's always humbling to look back at my own code and say "that idiot should have put some comments in here..." ;) yeah, I agree there are a lot of moving parts here. I

Re: REBALANCELEADERS is not reliable

2019-01-10 Thread Bernd Fehling
Hi Erik, that is very valuable info I missed. Shouldn't that belong into an issue about rework at REBALANCELEADERS? With your explanation the use of a queue makes sense and now I see some of the logic behind. - there is the leader and the firstWatcher - if firstWatcher goes down or is inactive t

Re: REBALANCELEADERS is not reliable

2019-01-09 Thread Erick Erickson
Executive summary: The central problem is "how can I insert an ephemeral node in a specific place in a ZK queue". The code could be much, much simpler if there were a reliable way to do just that. I haven't looked at more recent ZKs to see if it's possible, I'd love it if there were a better way.

Re: REBALANCELEADERS is not reliable

2019-01-09 Thread Bernd Fehling
Yes, your findings are also very strange. I wonder if we can discover the "inventor" of all this and ask him how it should work or better how he originally wanted it to work. Comments in the code (RebalanceLeaders.java) state that it is possible to have more than one electionNode with the same se

Re: REBALANCELEADERS is not reliable

2019-01-08 Thread Erick Erickson
It's weirder than that. In the current test on master, the assumption is that the node recorded as leader in ZK is actually the leader, see TestRebalanceLeaders.checkZkLeadersAgree(). The theory is that the identified leader node in ZK is actually the leader after the rebalance command. But you're

Re: REBALANCELEADERS is not reliable

2019-01-08 Thread Bernd Fehling
Hi Erick, after some more hours of debugging the rough result is, who ever invented this leader election did not check if an action returns the estimated result. There are only checks for exceptions, true/false, new sequence numbers and so on, but never if a leader election to the preferredleader

Re: REBALANCELEADERS is not reliable

2018-12-21 Thread Erick Erickson
I looked at the test last night and it's...disturbing. It succeeds 100% of the time. Manual testing seems to fail very often. Of course it was late and I was a bit cross-eyed, so maybe I wasn't looking at the manual tests correctly. Or maybe the test is buggy. I beasted the test 100x last night an

Re: REBALANCELEADERS is not reliable

2018-12-21 Thread Bernd Fehling
As far as I could see with debugger there is still a problem in requeing. There is a watcher and it is recognized that the watcher is not a preferredleader. So it tries to locate a preferredleader with success. It then calls makeReplicaFirstWatcher and gets a new sequence number for the preferre

Re: REBALANCELEADERS is not reliable

2018-12-20 Thread Erick Erickson
I'm reworking the test case, so hold off on doing that. If you want to raise a JIRA, though. please do and attach your patch... On Thu, Dec 20, 2018 at 10:53 AM Erick Erickson wrote: > > Nothing that I know of was _intentionally_ changed with this between > 6x and 7x. That said, nothing that I kn

Re: REBALANCELEADERS is not reliable

2018-12-20 Thread Erick Erickson
Nothing that I know of was _intentionally_ changed with this between 6x and 7x. That said, nothing that I know of was done to verify that TLOG and PULL replicas (added in 7x) were handled correctly. There's a test "TestRebalanceLeaders" for this functionality that has run since the feature was put

Re: REBALANCELEADERS is not reliable

2018-12-20 Thread Bernd Fehling
Hi Vadim, I just tried it with 6.6.5. In my test cloud with 5 shards, 5 nodes, 3 cores per node it missed one shard to become leader. But noticed that one shard already was leader. No errors or exceptions in logs. May be I should enable debug logging and try again to see all logging messages from

Re: REBALANCELEADERS is not reliable

2018-12-20 Thread Erick Erickson
; nearest releaase > -- > Vadim > > > -Original Message- > > From: Vadim Ivanov [mailto:vadim.iva...@spb.ntk-intourist.ru] > > Sent: Friday, December 07, 2018 6:13 PM > > To: solr-user@lucene.apache.org > > Subject: RE: REBALANCELEADERS is not reliable >

RE: REBALANCELEADERS is not reliable

2018-12-20 Thread Vadim Ivanov
pb.ntk-intourist.ru] > Sent: Friday, December 07, 2018 6:13 PM > To: solr-user@lucene.apache.org > Subject: RE: REBALANCELEADERS is not reliable > > I'm waiting for 7.6 or 7.5.1 and plan to apply patch from Endika Posadas to > it. > Then test again and hope it'll help

RE: REBALANCELEADERS is not reliable

2018-12-07 Thread Vadim Ivanov
user@lucene.apache.org > Subject: Re: REBALANCELEADERS is not reliable > > Thanks for looking this up. > It could be a hint where to jump into the code. > I wonder why they rejected a jira ticket about this problem? > > Regards, Bernd > > Am 06.12.18 um 16:31 schrieb

Re: REBALANCELEADERS is not reliable

2018-12-07 Thread Bernd Fehling
-Leader-node-deleted-when-rebalancing-leaders-td4417040.html May be it will shed some light? -Original Message- From: Atita Arora [mailto:atitaar...@gmail.com] Sent: Thursday, November 29, 2018 11:03 PM To: solr-user@lucene.apache.org Subject: Re: REBALANCELEADERS is not reliable Indeed, I

RE: REBALANCELEADERS is not reliable

2018-12-06 Thread Vadim Ivanov
Thursday, November 29, 2018 11:03 PM > To: solr-user@lucene.apache.org > Subject: Re: REBALANCELEADERS is not reliable > > Indeed, I tried that on 7.4 & 7.5 too, indeed did not work for me as well, > even with the preferredLeader property as recommended in the > documentation. &g

Re: REBALANCELEADERS is not reliable

2018-11-29 Thread Atita Arora
Indeed, I tried that on 7.4 & 7.5 too, indeed did not work for me as well, even with the preferredLeader property as recommended in the documentation. I handled it with a little hack but certainly this dint work as expected. I can provide more details if there's a ticket. On Thu, Nov 29, 2018 at 8

Re: REBALANCELEADERS is not reliable

2018-11-29 Thread Aman Tandon
++ correction On Fri, Nov 30, 2018, 01:10 Aman Tandon For me today, I deleted the leader replica of one of the two shard > collection. Then other replicas of that shard wasn't getting elected for > leader. > > After waiting for long tried the setting addreplicaprop preferred leader > on one of th

Re: REBALANCELEADERS is not reliable

2018-11-29 Thread Aman Tandon
For me today, I deleted the leader replica of one of the two shard collection. Then other replica of that shard was getting elected for leader. After waiting for long tried the setting addreplicaprop preferred leader on one of the replica then tried FORCELEADER but no luck. Then also tried rebalan

Re: REBALANCELEADERS is not reliable

2018-11-27 Thread Bernd Fehling
Hi Vadim, thanks for confirming. So it seems to be a general problem with Solr 6.x, 7.x and might be still there in the most recent versions. But where to start to debug this problem, is it something not correctly stored in zookeeper or is overseer the problem? I was also reading something abou

RE: REBALANCELEADERS is not reliable

2018-11-27 Thread Vadim Ivanov
Hi, Bernd I have tried REBALANCELEADERS with Solr 6.3 and 7.5 I had very similar results and notion that it's not reliable :( -- Br, Vadim > -Original Message- > From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] > Sent: Tuesday, November 27, 2018 5:13 PM > To: solr-user@lucene.