bq: you'll end up with N-2 nearly full boxes and 2 half-full boxes.

True, you'd have to repeat the process N times. At that point, though,
as Shawn mentions it's often easier to just re-index the whole thing.

Do note that one strategy is to create more shards than you need at
the beginning. Say you determine that 10 shards will work fine, but
you expect to grow your corpus by 2x. _Start_  with 20 shards
(multiple shards can be hosted in the same JVM, no problem, see
maxShardsPerNode in the collections API CREATE action. Then
as your corpus grows you can move the shards to their own
boxes.

This just kicks the can down the road of course, if your corpus grows
by 5x instead of 2x you're back to this discussion....

Best,
Erick

On Thu, Jan 8, 2015 at 7:08 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 1/8/2015 4:37 AM, Bram Van Dam wrote:
>> Hmm. That is a good point. I wonder if there's some kind of middle
>> ground here? Something that lets me send an update (or new document) to
>> an arbitrary node/shard but which is still routed according to my
>> specific requirements? Maybe this can already be achieved by messing
>> with the routing?
>
> <snip>
>
>> That's fine. We have a lot of query (pre-)processing outside of Solr.
>> It's no problem for us to send a couple of queries to a couple of shards
>> and aggregate the result ourselves. It would, of course, be nice if
>> everything worked in distributed mode, but at least for us it's not an
>> issue. This is a side effect of our complex reporting requirements -- we
>> do aggregation, filtering and other magic on data that is partially in
>> Solr and partially elsewhere.
>
> SolrCloud, when you do fully automatic document routing, does handle
> everything for you.  You can query any node and send updates to any
> node, and they will end up in the right place.  There is currently a
> strong caveat: Indexing performance sucks when updates are initially
> sent to the wrong node.  The performance hit is far larger than we
> expected it to be, so there is an issue in Jira to try and make that
> better.  No visible work has been done on the issue yet:
>
> https://issues.apache.org/jira/browse/SOLR-6717
>
> The Java client (SolrJ, specifically CloudSolrServer) sends all updates
> to the correct nodes, because it can access the clusterstate and knows
> where updates need to go and where the shard leaders are.
>
>> This is a very good point. But I don't think SPLITSHARD is the magical
>> answer here. If you have N shards on N boxes, and they are all getting
>> nearly "full" and you decide to split one and move half to a new box,
>> you'll end up with N-2 nearly full boxes and 2 half-full boxes. What
>> happens if the disks fill up further? Do I have to split each shard?
>> That sounds pretty nightmareish!
>
> Planning ahead for growth is critical with SolrCloud, but there is
> something you can do if you discover that you need to radically
> re-shard:  Create a whole new collection with the number of shards you
> want, likely using the original set of Solr servers plus some new ones.
>  Rebuild the index into that collection.  Delete the old collection, and
> create a collection alias pointing the original name at the new
> collection.  The alias will work for both queries and updates.
>
> Thanks,
> Shawn
>

Reply via email to