On 5/17/2010 2:40 PM, D C wrote:
We have a large index, separated
into multiple shards, that consists of records exported from a database.  One 
requirement is to support near real-time
synchronization with the database.  To accomplish this we are considering 
creating
a "daily" shard where create and update documents
(records never get deleted) will be posted and at the end of the day, "empty" 
the daily shard into
the other shards and start afresh the next day.

<snip>
My question is where can I
customize the solr code to specify that documents from a particular shard 
should be
given precedence in the search results.  Any pointers would be very much 
appreciated.


Quick answer: SOLR-1537.  https://issues.apache.org/jira/browse/SOLR-1537

Long answer begins with this: You probably don't need it.

This is exactly how we've got our system arranged, which has only been in production for a few weeks now. There are six static shards that contain all but the newest content. Another shard, which we call the incremental, holds the most recent data, currently three weeks. The incremental shard gets updated every two minutes and optimized once an hour. Deletes are run against all of the shards every ten minutes. To avoid unnecessary cache warming, the delete script checks for the presence of the deleted data before actually running the update. Once a night, the incremental index is trimmed to three weeks, with that data being distributed among the other shards, and one static shard gets optimized.

We have two unique identifiers in the database for each document. One is an autoincrement field we call did, for document ID. This is the primary key in the database table, but is used only behind the scenes. The other is tag_id, which is the field that a user sees and is the uniqueKey in Solr. When a document is updated, its did will change, but its tag_id will not. Deletes from Solr's perspective are handled by did, not tag_id, and when a document is updated, we treat the old did like any other delete. The new document gets added to our incremental shard very quickly, and a little bit later, the old one is deleted from the static shard that contains it.

The incremental shard is much smaller than the others, so it responds a lot faster. This means that there's a significant likelihood that it will always take precedence. For reliability reasons in the event of a hardware problem, we did incorporate the patch from SOLR-1537 into our system, which in addition to keeping the index up when a shard goes away, makes the deduplication order explicit. If you go the route you are planning, it is unlikely you'll need this. I have since added load balancing to my setup, so when we upgrade SOLR, this patch will no longer be used.

In the absence of a second identifier and SOLR-1537, you could get more deterministic behavior by using the delete mechanism in a slightly different way from mine - add it to your daily/incremental index, then find it in the other shards and delete it. It will mean a cache rewarm when the delete is committed, and I don't know if that will cause problems for your setup.

Thanks,
Shawn

Reply via email to