On 10/29/2012 7:55 AM, Dmitry Kan wrote:
Hi everyone,
at this year's Berlin Buzz words conference someone (sematext?) have
described a technique of a hot shard. The idea is to have a slim shard to
maximize the update throughput during a day (when millions of docs need to
be posted) and make sure the indexed documents are immediately searchable.
In the end of the day the day's documents are moved to cold shards. If I'm
not mistaken, this was implemented for ElasticSearch. I'm currently
implementing something similar (but pretty tailored to our logical sharding
use case) for Solr (3.x). The feature set looks roughly like this:
1) front end solr (query router) is aware of the hot shard: it directs the
incoming queries to the hot and "cold" shards.
2) new incoming documents are directed first to the hot shard and then
periodically (like once a day or once a week) moved over to the closest in
time cold shard. And for that...
3) hot shard index is being partitioned low level using Lucene's
IndexReader / IndexWriter with the implementation based on [1], [2] and
customized to logical (time-based) sharding.
The question is: is doing index partitioning low-level a good way of
implementing the hot shard concept? That is, is there anything better
operationally-wise from the point of view of disaster recovery / search
cluster support? Am I missing some obvious SOLR-ish solution?
Doing instead the periodical hot shard cleaning and re-posting its source
documents to the closest cold shard is less modular and hence more
complicated operationally for us.
Please let me know, if you need more details or if the problem isn't clear
enough. Thanks.
[1]
http://blog.foofactory.fi/2008/01/regenerating-equally-sized-shards-from.html
[2] https://github.com/HON-Khresmoi/hash-based-index-splitter
This is exactly how I set up my indexing, been that way since early 2010
when we first started using Solr 1.4.0. Now we are on 3.5 and an
upgrade to 4.1 (branch_4x) is in the works. Coming from terminology
used in our previous search product, we call the hot shard an
"incremental" shard. My SolrJ indexing application takes care of all
management of which documents are in the incremental and which documents
are in the large shards. We call the large ones "static" shards because
deletes and the occasional reinsert are the only updates that they
receive, except for the daily distribute process.
We don't do anything to send queries to the hot shard "first" ... it is
simply listed first in the shards parameter on what we call the broker
core. Average response time on the incremental is single digit, the
other shards average at about 30 to 40 milliseconds. Median numbers
(SOLR-1972 patch) are much better.
Thanks,
Shawn