On 10/30/2012 5:05 AM, Dmitry Kan wrote:
Hi Shawn,
Thanks for sharing your story. Let me get it right:
How do you keep the incremental shard slim enough over time, do you
periodically redistribute the documents from it onto cold shards? If yes,
how technically you do it: the Lucene low-level way or Solr / SolrJ way?
Warning: This email fits nicely into the tl;dr category. I'm including
entirely too much information because I'm not sure which bits you're
really interested in.
My database and Solr index have two fields that contain unique values.
Solr's unique key is what we call the tag_id (alphanumeric), but each
document also has a MySQL autoincrement field called did, for document
id, or possibly delete id, which is a tlong in the Solr schema. The
MySQL primary key is did. I divvy up documents among the six cold
shards by a mod on the crc32 hash (MySQL function) of the did field, my
cold shards are numbered 0 through 5. That crc32 hash is not indexed or
stored in Solr, but now that I think about it, perhaps I should add it
to the Solr-specific database view.
The did field is also where I look for my "split point" which marks the
line between hot and cold. Values less than or equal to the split point
are in cold shards, values greater than the split point go in the hot shard.
Once an hour, my SolrJ build system gets MAX(did) from the database and
stores it in a JRobin RRD. Every night, I consult those values and do
document counts against the database to pick a new split point. Then I
index documents between the old split point and the new split point into
the cold shards, and if that succeeds, I delete the same DID range from
the hot shard. I wrote all the code that does this using the SolrJ API,
storing persistent values in a MySQL database table. I'm not aware of
any shortcuts I could use.
Additional note: Full reindexes are accomplished with the dataimport
handler, using the following SQL query. For the hot shard, I pass in a
modVal of 0,1,2,3,4,5 so that it gets all of the documents in the did range:
SELECT * FROM ${dataimporter.request.dataView}
WHERE (
(
did > ${dataimporter.request.minDid}
AND did <= ${dataimporter.request.maxDid}
)
${dataimporter.request.extraWhere}
) AND (crc32(did) % ${dataimporter.request.numShards})
IN (${dataimporter.request.modVal})
Back when we first started with Solr 1.4.0, the build system was written
in Perl (LWP::Simple) and did everything but deletes with the dataimport
handler. Deletes were done by query using xml and the /update handler.
Thanks,
Shawn