On 10/30/2012 5:05 AM, Dmitry Kan wrote:
Hi Shawn,

Thanks for sharing your story. Let me get it right:

How do you keep the incremental shard slim enough over time, do you
periodically redistribute the documents from it onto cold shards? If yes,
how technically you do it: the Lucene low-level way or Solr / SolrJ way?

Warning: This email fits nicely into the tl;dr category. I'm including entirely too much information because I'm not sure which bits you're really interested in.

My database and Solr index have two fields that contain unique values. Solr's unique key is what we call the tag_id (alphanumeric), but each document also has a MySQL autoincrement field called did, for document id, or possibly delete id, which is a tlong in the Solr schema. The MySQL primary key is did. I divvy up documents among the six cold shards by a mod on the crc32 hash (MySQL function) of the did field, my cold shards are numbered 0 through 5. That crc32 hash is not indexed or stored in Solr, but now that I think about it, perhaps I should add it to the Solr-specific database view.

The did field is also where I look for my "split point" which marks the line between hot and cold. Values less than or equal to the split point are in cold shards, values greater than the split point go in the hot shard.

Once an hour, my SolrJ build system gets MAX(did) from the database and stores it in a JRobin RRD. Every night, I consult those values and do document counts against the database to pick a new split point. Then I index documents between the old split point and the new split point into the cold shards, and if that succeeds, I delete the same DID range from the hot shard. I wrote all the code that does this using the SolrJ API, storing persistent values in a MySQL database table. I'm not aware of any shortcuts I could use.

Additional note: Full reindexes are accomplished with the dataimport handler, using the following SQL query. For the hot shard, I pass in a modVal of 0,1,2,3,4,5 so that it gets all of the documents in the did range:

        SELECT * FROM ${dataimporter.request.dataView}
        WHERE (
          (
            did > ${dataimporter.request.minDid}
            AND did <= ${dataimporter.request.maxDid}
          )
          ${dataimporter.request.extraWhere}
        ) AND (crc32(did) % ${dataimporter.request.numShards})
          IN (${dataimporter.request.modVal})

Back when we first started with Solr 1.4.0, the build system was written in Perl (LWP::Simple) and did everything but deletes with the dataimport handler. Deletes were done by query using xml and the /update handler.

Thanks,
Shawn

Reply via email to