Re: hot shard concept

Shawn Heisey Tue, 30 Oct 2012 22:51:11 -0700

On 10/30/2012 5:05 AM, Dmitry Kan wrote:

Hi Shawn,


Thanks for sharing your story. Let me get it right:

How do you keep the incremental shard slim enough over time, do you
periodically redistribute the documents from it onto cold shards? If yes,
how technically you do it: the Lucene low-level way or Solr / SolrJ way?

Warning: This email fits nicely into the tl;dr category. I'm includingentirely too much information because I'm not sure which bits you'rereally interested in.

My database and Solr index have two fields that contain unique values.Solr's unique key is what we call the tag_id (alphanumeric), but eachdocument also has a MySQL autoincrement field called did, for documentid, or possibly delete id, which is a tlong in the Solr schema. TheMySQL primary key is did. I divvy up documents among the six coldshards by a mod on the crc32 hash (MySQL function) of the did field, mycold shards are numbered 0 through 5. That crc32 hash is not indexed orstored in Solr, but now that I think about it, perhaps I should add itto the Solr-specific database view.

The did field is also where I look for my "split point" which marks theline between hot and cold. Values less than or equal to the split pointare in cold shards, values greater than the split point go in the hot shard.

Once an hour, my SolrJ build system gets MAX(did) from the database andstores it in a JRobin RRD. Every night, I consult those values and dodocument counts against the database to pick a new split point. Then Iindex documents between the old split point and the new split point intothe cold shards, and if that succeeds, I delete the same DID range fromthe hot shard. I wrote all the code that does this using the SolrJ API,storing persistent values in a MySQL database table. I'm not aware ofany shortcuts I could use.

Additional note: Full reindexes are accomplished with the dataimporthandler, using the following SQL query. For the hot shard, I pass in amodVal of 0,1,2,3,4,5 so that it gets all of the documents in the did range:


        SELECT * FROM ${dataimporter.request.dataView}
        WHERE (
          (
            did &gt; ${dataimporter.request.minDid}
            AND did &lt;= ${dataimporter.request.maxDid}
          )
          ${dataimporter.request.extraWhere}
        ) AND (crc32(did) % ${dataimporter.request.numShards})
          IN (${dataimporter.request.modVal})

Back when we first started with Solr 1.4.0, the build system was writtenin Perl (LWP::Simple) and did everything but deletes with the dataimporthandler. Deletes were done by query using xml and the /update handler.


Thanks,
Shawn

Re: hot shard concept

Reply via email to