Hi Shawn, Too much technical details are often better, than too little, to my taste of course :-) You approach to sharding is apparently hashing based. And that's why you need to maintain doc did values that come from MySQL in a separate storage and decide on split point. That's totally legit. And I must say, it is quite an interesting approach (makes me thing to try something like this when time permits).
There is another way of looking at sharding. That is the time based sharding. If your documents (like in our case) are attributed to some point in time, they belong to their "time" shards. In this case the hot shard is always the youngest of all your shards in the cluster and slides in time to the future. Same way as you do when choosing new split point every night and reindexing, in time based sharding, documents that became too old to be in the hot shard (sort if they are not hot anymore, that is, not searched for as often), they can join the cold shards. Thus the cold shards in this scheme are always growing as hot shard will be pushing more and more documents into them, while the hot shard remains slim enough to sustain heavy update and search rate (hot documents are searched for more times, than the cold ones). My original question was, basically, what technique to use in order to physically move hot shard documents to cold shards. Because we want to decouple the document processing backend (our docs are not stored in any DB, they are real textual documents) and because our cold shards are very big (~60GB easily), we wanted to find an optimal solution for the task that wouldn't require reindexing and coupling the document processing backend with SOLR too tight. And this seems to be possible by using low-level Lucene index partitioning (see some links in my first post). Dmitry On Wed, Oct 31, 2012 at 7:50 AM, Shawn Heisey <s...@elyograg.org> wrote: > On 10/30/2012 5:05 AM, Dmitry Kan wrote: > >> Hi Shawn, >> >> Thanks for sharing your story. Let me get it right: >> >> How do you keep the incremental shard slim enough over time, do you >> periodically redistribute the documents from it onto cold shards? If yes, >> how technically you do it: the Lucene low-level way or Solr / SolrJ way? >> > > Warning: This email fits nicely into the tl;dr category. I'm including > entirely too much information because I'm not sure which bits you're really > interested in. > > My database and Solr index have two fields that contain unique values. > Solr's unique key is what we call the tag_id (alphanumeric), but each > document also has a MySQL autoincrement field called did, for document id, > or possibly delete id, which is a tlong in the Solr schema. The MySQL > primary key is did. I divvy up documents among the six cold shards by a > mod on the crc32 hash (MySQL function) of the did field, my cold shards are > numbered 0 through 5. That crc32 hash is not indexed or stored in Solr, > but now that I think about it, perhaps I should add it to the Solr-specific > database view. > > The did field is also where I look for my "split point" which marks the > line between hot and cold. Values less than or equal to the split point > are in cold shards, values greater than the split point go in the hot shard. > > Once an hour, my SolrJ build system gets MAX(did) from the database and > stores it in a JRobin RRD. Every night, I consult those values and do > document counts against the database to pick a new split point. Then I > index documents between the old split point and the new split point into > the cold shards, and if that succeeds, I delete the same DID range from the > hot shard. I wrote all the code that does this using the SolrJ API, > storing persistent values in a MySQL database table. I'm not aware of any > shortcuts I could use. > > Additional note: Full reindexes are accomplished with the dataimport > handler, using the following SQL query. For the hot shard, I pass in a > modVal of 0,1,2,3,4,5 so that it gets all of the documents in the did range: > > SELECT * FROM ${dataimporter.request.**dataView} > WHERE ( > ( > did > ${dataimporter.request.minDid} > AND did <= ${dataimporter.request.maxDid} > ) > ${dataimporter.request.**extraWhere} > ) AND (crc32(did) % ${dataimporter.request.**numShards}) > IN (${dataimporter.request.**modVal}) > > Back when we first started with Solr 1.4.0, the build system was written > in Perl (LWP::Simple) and did everything but deletes with the dataimport > handler. Deletes were done by query using xml and the /update handler. > > Thanks, > Shawn > > -- Regards, Dmitry Kan