On 5/17/2010 2:40 PM, D C wrote:
We have a large index, separated
into multiple shards, that consists of records exported from a database. One
requirement is to support near real-time
synchronization with the database. To accomplish this we are considering
creating
a "daily" shard where create and update documents
(records never get deleted) will be posted and at the end of the day, "empty"
the daily shard into
the other shards and start afresh the next day.
<snip>
My question is where can I
customize the solr code to specify that documents from a particular shard
should be
given precedence in the search results. Any pointers would be very much
appreciated.
Quick answer: SOLR-1537. https://issues.apache.org/jira/browse/SOLR-1537
Long answer begins with this: You probably don't need it.
This is exactly how we've got our system arranged, which has only been
in production for a few weeks now. There are six static shards that
contain all but the newest content. Another shard, which we call the
incremental, holds the most recent data, currently three weeks. The
incremental shard gets updated every two minutes and optimized once an
hour. Deletes are run against all of the shards every ten minutes. To
avoid unnecessary cache warming, the delete script checks for the
presence of the deleted data before actually running the update. Once a
night, the incremental index is trimmed to three weeks, with that data
being distributed among the other shards, and one static shard gets
optimized.
We have two unique identifiers in the database for each document. One
is an autoincrement field we call did, for document ID. This is the
primary key in the database table, but is used only behind the scenes.
The other is tag_id, which is the field that a user sees and is the
uniqueKey in Solr. When a document is updated, its did will change, but
its tag_id will not. Deletes from Solr's perspective are handled by
did, not tag_id, and when a document is updated, we treat the old did
like any other delete. The new document gets added to our incremental
shard very quickly, and a little bit later, the old one is deleted from
the static shard that contains it.
The incremental shard is much smaller than the others, so it responds a
lot faster. This means that there's a significant likelihood that it
will always take precedence. For reliability reasons in the event of a
hardware problem, we did incorporate the patch from SOLR-1537 into our
system, which in addition to keeping the index up when a shard goes
away, makes the deduplication order explicit. If you go the route you
are planning, it is unlikely you'll need this. I have since added load
balancing to my setup, so when we upgrade SOLR, this patch will no
longer be used.
In the absence of a second identifier and SOLR-1537, you could get more
deterministic behavior by using the delete mechanism in a slightly
different way from mine - add it to your daily/incremental index, then
find it in the other shards and delete it. It will mean a cache rewarm
when the delete is committed, and I don't know if that will cause
problems for your setup.
Thanks,
Shawn