In another thread, something was said that sparked my interest:

On 12/1/2011 7:17 PM, Ted Dunning wrote:
Of course, resharding is almost never necessary if you use micro-shards.
  Micro-shards are shards small enough that you can fit 20 or more on a
node.  If you have that many on each node, then adding a new node consists
of moving some shards to the new machine rather than moving lots of little
documents.

Much faster.  As in thousands of times faster.

My questions are interspersed with information about my index.

Currently I split my data into shards in two ways. The most recent data (between 3.5 to 7 days, trying to keep it below 500,000 records) goes into one shard. The rest of the data is split using the formula crc32(did) % numShards. The value of numShards is currently six. Each of those large shards has nearly 11 million documents in 20GB of disk space.

I am already using the concept of micro-sharding, but certainly not on a grand scale. One copy of the index is served by two hosts with 8 CPU cores, so each host has three of the large shards. Doing some least common multiple calculations, I have determined that 420 shards would allow me to use the shard-moving method to add one host at a time until I am up to 7 hosts. To reach 8, I would need 840 shards, and to make it to 9 or 10, I would need 2520 shards. A mere 60 shards would let me go up to 5 or 6 hosts.

I am curious as to the amount of overhead that large numbers of shards would introduce. I already know from experience that when an index is optimized from 20-30 largish segments (initial full index) to one, it shrinks a little bit. I assume that there would be similar overhead involved in having a lot of shards. Does anyone have any way to know how much overhead that would be?

Our search results grids are currently 70 items. If someone were to go through the results to page 21, they would be asking for a start value of 1400. With 420 shards, the distributed search would have to deal with 588000 items. That's a lot of results to deal with. The overhead is much smaller with 60 shards, but I've seen searches that indicate some dedicated individuals will delve a lot deeper than 20 pages. How much extra memory does it take when a distributed search has to deal with a million or more results? I've got an 8GB heap for Solr, which has been more than enough for everything but a distributed termsComponent request on my largest field. I don't attempt those any more, it always requires a Solr restart before normal queries will resume.

I already have a way to deal with resharding, because I can rebuild one copy of my index with an independent new configuration while the other stays completely online. It takes a few hours, of course. There's overhead with micro-sharding. The index would get larger, and the inherent problems with deep paging in distributed search will be amplified by a large increase in shard count. Are the potential benefits worth incurring that overhead?

Thanks,
Shawn

Reply via email to