Bernd: I just attached a patch to https://issues.apache.org/jira/browse/SOLR-13091. It's still rough, the response from REBALANCELEADERS needs quite a bit of work (lots of extra stuff in it now, and no overall verification). I haven't run all the tests, nor precommit.
I wanted to get something up so if you have a test environment that you can easily test it in you'd have an early chance to play with it. It's against master, I also haven't tried to backport to 8.0 or 7x yet. I doubt it'll be a problem, but if it does't apply cleanly let me know. Best, Erick On Fri, Jan 11, 2019 at 8:33 AM Erick Erickson <erickerick...@gmail.com> wrote: > > bq: You have to check if the cores, participating in leadership > election, are _really_ > in sync. And this must be done before starting any rebalance. > Sounds ugly... :-( > > This _should_ not be necessary. I'll add parenthetically that leader > election has > been extensively re-worked in Solr 7.3+ though because "interesting" things > could happen. > > Manipulating the leader election queue is really no different than > having to deal with, say, someone killing the leader un-gracefully. It should > "just work". That said if you're seeing evidence to the contrary that's > reality. > > What do you mean by "stats" though? It's perfectly ordinary for there to > be different numbers of _deleted_ documents on various replicas, and > consequently things like term frequencies and doc frequencies being > different. What's emphatically _not_ expected is for there to be different > numbers of "live" docs. > > "making sure nodes are in sync" is certainly an option. That should all > be automatic if you pause indexing and issue a commit, _then_ > do a rebalance. > > I certainly agree that the code is broken and needs to be fixed, but I > also have to ask how many shards are we talking here? The code was > originally written for the case where 100s of leaders could be on the > same node, until you get in to a significant number of leaders on > a single node (10s at least) there haven't been reliable stats showing > that it's a performance issue. If you have threshold numbers where > you've seen it make a material difference it'd be great to share them. > > And I won't be getting back to this until the weekend, other urgent > stuff has come up... > > Best, > Erick > > On Fri, Jan 11, 2019 at 12:58 AM Bernd Fehling > <bernd.fehl...@uni-bielefeld.de> wrote: > > > > Hi Erik, > > yes, I would be happy to test any patches. > > > > Good news, I got rebalance working. > > After running the rebalance about 50 times with debugger and watching > > the behavior of my problem shard and its core_nodes within my test cloud > > I came to the point of failure. I solved it and now it works. > > > > Bad news, rebalance is still not reliable and there are many more > > problems and point of failure initiated by rebalanceLeaders or better > > by re-queueing the watchlist. > > > > How I located _my_ problem: > > Test cloud is 5 server (VM), 5 shards, 3 replica per shard, 1 java > > instance per server. 3 separate zookeepers. > > My problem, shard2 wasn't willing to rebalance to a specific core_node. > > core_nodes related (core_node1, core_node2, core_node10). > > core_node10 was the preferredLeader. > > It was just changing leader ship between core_node1 and core_node2, > > back and forth, whenever I called rebalanceLeader. > > First step, I stopped the server holding core_node2. > > Result, the leadership was staying at core_node1 whenever I called > > rebalanceLeaders. > > Second step, from debugger I _forced_ during rebalanceLeaders the > > system to give the leadership to core_node10. > > Result, there was no leader anymore for that shard. Yes it can happen, > > you can end up with a shard having no leader but active core_nodes!!! > > To fix this I was giving preferredLeader to core_node1 and called > > rebalanceLeaders. > > After that, preferredLeader was set back to core_node10 and I was back > > at the point I started, all calls to rebalanceLeaders kept the leader at > > core_node1. > > > > From the debug logs I got the hint about PeerSync of cores and > > IndexFingerprint. > > The stats from my problem core_node10 showed that they differ from leader > > core_node1. > > And the system notices the difference, starts a PeerSync and ends with > > success. > > But actually the PeerSync seem to fail, because the stats of core_node1 and > > core_node10 still differ afterwards. > > Solution, I also stopped my server holding my problem core_node10, wiped > > all data > > directories and started that server again. The core_nodes where rebuilt > > from leader > > and now they are really in sync. > > Calling now rebalanceLeaders ended now with success to preferredLeader. > > > > My guess: > > You have to check if the cores, participating in leadership election, are > > _really_ > > in sync. And this must be done before starting any rebalance. > > Sounds ugly... :-( > > > > Next question, why is PeerSync not reporting an error? > > There is an info about "PeerSync START", "PeerSync Received 0 versions from > > ... fingeprint:null" > > and "PeerSync DONE. sync succeeded" but the cores are not really in sync. > > > > Another test I did (with my new knowledge about synced cores): > > - Removing all preferredLeader properties > > - stopping, wiping data directory, starting all server one by one to get > > all cores of all shards in sync > > - setting one preferredLeader for each shard but different from the actual > > leader > > - calling rebalanceLeaders succeeded only at 2 shards with the first run, > > not for all 5 shards (even with really all cores in sync). > > - after calling rebalanceLeaders again the other shards succeeded also. > > Result, rebalanceLeaders is still not reliable. > > > > I have to mention that I have about 520.000 docs per core in my test cloud > > and that there might also be a timing issue between calling > > rebalanceLeaders, > > detecting that cores to become leader are not in sync with actual leader, > > and resync while waiting for new leader election. > > > > So far, > > Bernd > > > > > > Am 10.01.19 um 17:02 schrieb Erick Erickson: > > > Bernd: > > > > > > Don't feel bad about missing it, I wrote the silly stuff and it took me > > > some time to remember..... > > > > > > Those are the rules. > > > > > > It's always humbling to look back at my own code and say "that > > > idiot should have put some comments in here..." ;) > > > > > > yeah, I agree there are a lot of moving parts here. I have a note to > > > myself to provide better feedback in the response. You're absolutely > > > right that we fire all these commands and hope they all work. Just > > > returning "success" status doesn't guarantee leadership change. > > > > > > I'll be on another task the rest of this week, but I should be able > > > to dress things up over the weekend. That'll give you a patch to test > > > if you're willing. > > > > > > The actual code changes are pretty minimal, the bulk of the patch > > > will be the reworked test. > > > > > > Best, > > > Erick > > >