In another thread, something was said that sparked my interest:
On 12/1/2011 7:17 PM, Ted Dunning wrote:
Of course, resharding is almost never necessary if you use micro-shards.
Micro-shards are shards small enough that you can fit 20 or more on a
node. If you have that many on each node, then adding a new node consists
of moving some shards to the new machine rather than moving lots of little
documents.
Much faster. As in thousands of times faster.
My questions are interspersed with information about my index.
Currently I split my data into shards in two ways. The most recent data
(between 3.5 to 7 days, trying to keep it below 500,000 records) goes
into one shard. The rest of the data is split using the formula
crc32(did) % numShards. The value of numShards is currently six. Each
of those large shards has nearly 11 million documents in 20GB of disk space.
I am already using the concept of micro-sharding, but certainly not on a
grand scale. One copy of the index is served by two hosts with 8 CPU
cores, so each host has three of the large shards. Doing some least
common multiple calculations, I have determined that 420 shards would
allow me to use the shard-moving method to add one host at a time until
I am up to 7 hosts. To reach 8, I would need 840 shards, and to make it
to 9 or 10, I would need 2520 shards. A mere 60 shards would let me go
up to 5 or 6 hosts.
I am curious as to the amount of overhead that large numbers of shards
would introduce. I already know from experience that when an index is
optimized from 20-30 largish segments (initial full index) to one, it
shrinks a little bit. I assume that there would be similar overhead
involved in having a lot of shards. Does anyone have any way to know
how much overhead that would be?
Our search results grids are currently 70 items. If someone were to go
through the results to page 21, they would be asking for a start value
of 1400. With 420 shards, the distributed search would have to deal
with 588000 items. That's a lot of results to deal with. The overhead
is much smaller with 60 shards, but I've seen searches that indicate
some dedicated individuals will delve a lot deeper than 20 pages. How
much extra memory does it take when a distributed search has to deal
with a million or more results? I've got an 8GB heap for Solr, which
has been more than enough for everything but a distributed
termsComponent request on my largest field. I don't attempt those any
more, it always requires a Solr restart before normal queries will resume.
I already have a way to deal with resharding, because I can rebuild one
copy of my index with an independent new configuration while the other
stays completely online. It takes a few hours, of course. There's
overhead with micro-sharding. The index would get larger, and the
inherent problems with deep paging in distributed search will be
amplified by a large increase in shard count. Are the potential
benefits worth incurring that overhead?
Thanks,
Shawn