Thanks Shawn, clean way to do it, indeed. And going your route, one
could even copy the existing shards into the new collection and then
delete the data which is getting reindexed on the new nodes. That would
spare reindexing everything.
But in my case, I add boxes after a noticeable performance degradation
due to data volume increase. So the old boxes cannot afford reindexing
data (or deleting if using the propose variation) in the new collection
while serving searches with the old collection. Unless there is a way to
bound aggressively the RAM consumption of new collection (disabling
MMAP?), given that it's not being used for search during the transition?
That said, even if that was possible, both collections would compete for
disk IOs.
Thanks,
Damien
On 07/07/2014 12:26 PM, Shawn Heisey wrote:
On 7/7/2014 12:41 PM, Damien Dykman wrote:
I have a cluster of N boxes/nodes and I'd like to add M boxes/nodes
and rebalance data accordingly.
Lets add the following constraints:
- 1. boxes have different characteristics (RAM, CPU, disks)
- 2. different number of shards per box/node (lets pretend we have
found the sweet spot for each box)
- 3. once rebalancing is over, the layout of the cluster should be
the same as if it had been bootstrapped from N+M boxes
Because of the above constraints, shard splitting or moving shards
around is not an option. And too keep the discussion simple, lets
ignore shard replicas.
So far, the best scenario I could think of is the following:
- a. 1 collection on the N nodes using implicit routing
- b. add shards on the M new nodes as part of that collection
- c. reindex a portion of the data on the shards of the M new nodes,
while restricting them from search
- d. in 1 transaction, delete the old data and immediately issue a
soft commit and remove search restrictions
You may not like this answer, but here's a fairly clean way to do this,
assuming you have enough disk space on the existing machines:
1. Add the new boxes to the cluster.
2. Create a new collection across all the boxes.
2a. If your current collection is named "test" then name the new one
"test0" or something else that's related, but different.
3. Index all data into the new collection.
4. As quickly as possible, do the following actions:
4a. Stop indexing.
4b. Do a synchronization pass on the new collection so it's current.
4c. Delete the original collection.
4d. Create a collection alias so that you can access the new collection
with the original collection name.
4e. Restart indexing.
Thanks,
Shawn