Micro-Sharding

Shawn Heisey Sat, 03 Dec 2011 10:54:51 -0800

In another thread, something was said that sparked my interest:


On 12/1/2011 7:17 PM, Ted Dunning wrote:

Of course, resharding is almost never necessary if you use micro-shards.
  Micro-shards are shards small enough that you can fit 20 or more on a
node.  If you have that many on each node, then adding a new node consists
of moving some shards to the new machine rather than moving lots of little
documents.

Much faster.  As in thousands of times faster.


My questions are interspersed with information about my index.

Currently I split my data into shards in two ways. The most recent data(between 3.5 to 7 days, trying to keep it below 500,000 records) goesinto one shard. The rest of the data is split using the formulacrc32(did) % numShards. The value of numShards is currently six. Eachof those large shards has nearly 11 million documents in 20GB of disk space.

I am already using the concept of micro-sharding, but certainly not on agrand scale. One copy of the index is served by two hosts with 8 CPUcores, so each host has three of the large shards. Doing some leastcommon multiple calculations, I have determined that 420 shards wouldallow me to use the shard-moving method to add one host at a time untilI am up to 7 hosts. To reach 8, I would need 840 shards, and to make itto 9 or 10, I would need 2520 shards. A mere 60 shards would let me goup to 5 or 6 hosts.

I am curious as to the amount of overhead that large numbers of shardswould introduce. I already know from experience that when an index isoptimized from 20-30 largish segments (initial full index) to one, itshrinks a little bit. I assume that there would be similar overheadinvolved in having a lot of shards. Does anyone have any way to knowhow much overhead that would be?

Our search results grids are currently 70 items. If someone were to gothrough the results to page 21, they would be asking for a start valueof 1400. With 420 shards, the distributed search would have to dealwith 588000 items. That's a lot of results to deal with. The overheadis much smaller with 60 shards, but I've seen searches that indicatesome dedicated individuals will delve a lot deeper than 20 pages. Howmuch extra memory does it take when a distributed search has to dealwith a million or more results? I've got an 8GB heap for Solr, whichhas been more than enough for everything but a distributedtermsComponent request on my largest field. I don't attempt those anymore, it always requires a Solr restart before normal queries will resume.

I already have a way to deal with resharding, because I can rebuild onecopy of my index with an independent new configuration while the otherstays completely online. It takes a few hours, of course. There'soverhead with micro-sharding. The index would get larger, and theinherent problems with deep paging in distributed search will beamplified by a large increase in shard count. Are the potentialbenefits worth incurring that overhead?


Thanks,
Shawn

Micro-Sharding

Reply via email to