RE: Does shard splitting double host count

Garth Grimm Fri, 27 Feb 2015 11:29:15 -0800

You can't just add a new core to an existing collection.  You can add the new 
node to the cloud, but it won't be part of any collection.  You're not going to 
be able to just slide it in as a 4th shard to an established collection of 3 
shards.


The root of that comes from routing (I'll assume you use default routing, 
rather than any custom routing).  When you index a document into the cloud, it 
gets a unique id number attached to it.  If you have 3 shards, than each shard 
gets 1/3 of the range of those possible ids.  Inserts and/or updates for the 
same document will have the same id and be routed to the same shard.

Shard splitting just divides the range of the shard in half, and copies 
documents to the 2 new shards based upon where their id's now fall in the new 
range.  That's a little easier to manage than the more complex process of 
adding one shard, then having to adjust the ranges on all the other shards, and 
then copy entries that have to move -- all the while ensuring that new 
adds/updates/deletes are being routed to the correct location based upon 
whether the original has been copied over to the new ranges or not, yada, yada, 
yada.  I believe there's been some discussions about how to add a capability 
like that to solr (i.e. adjust shard ranges and have documents moved and 
handled correctly), but I don't think it's even in 5.0.

Now, if you feel the need to go down this path of adding a single shard to a 3 
shard collection, here's something similar.  Add your new solr node to the 
cloud.  Then create a 1 shard, 2 replica collection called "collectionPart2".   
Also add a query alias for "TotalCollection" that points to "collectionPart1", 
"collectionPart2".   That way a query will get processed by all 4 of your 
shards.  Now this will make indexing more difficult, because you'll have to 
send your new documents to "collectionPart2" until that collection's shard gets 
about as big as the shards on your 3 shard collection.  But some source data 
can be split up like that fairly easily, especially sequential data source.

For example, if indexing twitter or email feeds, you can create new collection 
with appropriate shard/replica configuration and feed in a day (or month, or 
whatever) of data.  Then repeat with a new collection for the next set.  Keep 
the query alias updated to span the collections you're interested in.

-----Original Message-----
From: tuxedomoon [mailto:dancolem...@yahoo.com] 
Sent: Friday, February 27, 2015 12:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Does shard splitting double host count

What about adding one new leader/replica pair?  It seems that would entail

a) creating the r3.large instances and volumes
b) adding 2 new Zookeeper hosts?
c) updating my Zookeeper configs (new hosts, new ids, new SOLR config)
d) restarting all ZKs
e) restarting SOLR hosts in sequence needed for correct shard/replica assignment
f)  start indexing again

So shards 1,2,3 start with 33% of the docs each.  As I start indexing new 
documents get sharded at 25% per shard.  If I reindex a document that exists 
already in shard2, does it remain in shard2 or could it migrate to another 
shard, thus removing it from shard2.

I'm looking for a migration strategy to achieve 25% docs per shard.  I would 
also consider deleting docs by daterange from shards1,2,3 and reindexing them 
to redistribute evenly.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-shard-splitting-double-host-count-tp4189595p4189672.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Does shard splitting double host count

Reply via email to