On Mon, Mar 21, 2011 at 10:57 AM, Shawn Heisey <s...@elyograg.org> wrote:
> On 3/15/2011 12:54 PM, onlinespend...@gmail.com wrote: > >> That's pretty interesting to use the autoincrementing document ID as a way >> to keep track of what has not been indexed in Solr. And you overwrite >> this >> document ID even when you modify an existing document. Very cool. I >> suppose the number can even rotate back to 0, as long as you handle that. >> > > We use a bigint for the value, and the highest value is currently less than > 300 million, so we don't expect it to ever rotate around to 0. My build > system would not be able to handle wrapraound without manual intervention. > If we have that problem, I think we'd have to renumber the entire database > and reindex. One solution to reduce the rate at which this number grows would be to store a "batch ID" rather than a "document ID". If you've just added batch #1428 to the Solr index, then any new updated documents in your SQL database would be assigned #1429. Since you already have a unique tag ID, you may be OK with a non-unique ID for the sake of keeping track of index updates. > > > I am thinking of using a timestamp to achieve a similar thing. All >> documents >> that have been accessed after the last Solr index need to be added to the >> Solr index. In fact, each name-value pair in Cassandra has a timestamp >> associated with it, so I'm curious if I could simply use this. >> > > As long as you can guarantee that it's all deterministic and idempotent, > you can use anything you like. I hope you know what those words mean. :) > It's important when using timestamps that the system that runs the build > script is the same one that stores the last-used timestamp. That way you > are guaranteed that you will never have things getting missed because of > clock skew. Yes, that is a concern of mine. If I go with a timestamp I'll certainly need to pay close attention to things. > > > I'm curious how you handle the delta-imports. Do you have some routine >> that >> periodically checks for updates to your MySQL database via the document >> ID? >> Which language do you use for that? >> > > The entire build system is written in Perl, where I am comfortable. I even > wrote an object-oriented module that the scripts share. The update script > runs every two minutes, from cron, indexing anything with a higher document > ID than the one recorded during the last successful run. There are some > other scripts that run on longer intervals and handle things like deletes > and data redistribution into shards. These scripts kick off the build, then > use the bare /dataimport URL to track when the import completes and whether > it's successful. > Thanks, > Shawn > Thanks for the info. That's very helpful! Ben