Hi Erik, yes, I would be happy to test any patches. Good news, I got rebalance working. After running the rebalance about 50 times with debugger and watching the behavior of my problem shard and its core_nodes within my test cloud I came to the point of failure. I solved it and now it works.
Bad news, rebalance is still not reliable and there are many more problems and point of failure initiated by rebalanceLeaders or better by re-queueing the watchlist. How I located _my_ problem: Test cloud is 5 server (VM), 5 shards, 3 replica per shard, 1 java instance per server. 3 separate zookeepers. My problem, shard2 wasn't willing to rebalance to a specific core_node. core_nodes related (core_node1, core_node2, core_node10). core_node10 was the preferredLeader. It was just changing leader ship between core_node1 and core_node2, back and forth, whenever I called rebalanceLeader. First step, I stopped the server holding core_node2. Result, the leadership was staying at core_node1 whenever I called rebalanceLeaders. Second step, from debugger I _forced_ during rebalanceLeaders the system to give the leadership to core_node10. Result, there was no leader anymore for that shard. Yes it can happen, you can end up with a shard having no leader but active core_nodes!!! To fix this I was giving preferredLeader to core_node1 and called rebalanceLeaders. After that, preferredLeader was set back to core_node10 and I was back at the point I started, all calls to rebalanceLeaders kept the leader at core_node1. From the debug logs I got the hint about PeerSync of cores and IndexFingerprint. The stats from my problem core_node10 showed that they differ from leader core_node1. And the system notices the difference, starts a PeerSync and ends with success. But actually the PeerSync seem to fail, because the stats of core_node1 and core_node10 still differ afterwards. Solution, I also stopped my server holding my problem core_node10, wiped all data directories and started that server again. The core_nodes where rebuilt from leader and now they are really in sync. Calling now rebalanceLeaders ended now with success to preferredLeader. My guess: You have to check if the cores, participating in leadership election, are _really_ in sync. And this must be done before starting any rebalance. Sounds ugly... :-( Next question, why is PeerSync not reporting an error? There is an info about "PeerSync START", "PeerSync Received 0 versions from ... fingeprint:null" and "PeerSync DONE. sync succeeded" but the cores are not really in sync. Another test I did (with my new knowledge about synced cores): - Removing all preferredLeader properties - stopping, wiping data directory, starting all server one by one to get all cores of all shards in sync - setting one preferredLeader for each shard but different from the actual leader - calling rebalanceLeaders succeeded only at 2 shards with the first run, not for all 5 shards (even with really all cores in sync). - after calling rebalanceLeaders again the other shards succeeded also. Result, rebalanceLeaders is still not reliable. I have to mention that I have about 520.000 docs per core in my test cloud and that there might also be a timing issue between calling rebalanceLeaders, detecting that cores to become leader are not in sync with actual leader, and resync while waiting for new leader election. So far, Bernd Am 10.01.19 um 17:02 schrieb Erick Erickson:
Bernd: Don't feel bad about missing it, I wrote the silly stuff and it took me some time to remember..... Those are the rules. It's always humbling to look back at my own code and say "that idiot should have put some comments in here..." ;) yeah, I agree there are a lot of moving parts here. I have a note to myself to provide better feedback in the response. You're absolutely right that we fire all these commands and hope they all work. Just returning "success" status doesn't guarantee leadership change. I'll be on another task the rest of this week, but I should be able to dress things up over the weekend. That'll give you a patch to test if you're willing. The actual code changes are pretty minimal, the bulk of the patch will be the reworked test. Best, Erick