On 8/7/2015 8:56 AM, Davis, Daniel (NIH/NLM) [C] wrote: > I have an application that knows enough to tell me that a document has been > updated, but not which document has been updated. There aren't that many > documents in this core/collection - just a couple of 1000. So far I've just > been pumping them all to the update handler every week, but the business folk > really want the database and the index to be synchronized when the back-end > staff make an update. As is typical in indexing, updates are more frequent > than searchers (or at least are expected to be once things pick-up - we may > even reach a whopping 10k documents at some point :)) > > Each document has an id I wish to use as the unique ID, but I also want to > compute a signature. Is there some way I can use an > updateRequestProcessorChain to throw away a document if its signature and > document id match based on real-time get? > > My apologies if this is a duplicate of a prior question - solr-user is faily > high traffic.
My main Solr indexes are each generated from a MySQL database. One contains over 100 million rows, another over 200 million. A third contains about 18 million. Here's how we handle the requirement you asked about: The main table has a delete id column that is its primary key. This is an autoincrement column. There is another unique index on another column in that table, which is the canonical unique identifier, used as Solr's uniqueKey. The main table also has triggers for DELETE and UPDATE which add records to the idx_delete table (contains delete id values) and idx_reinsert table (contains unique key values). These extra tables each have a primary key on an autoincrement column. The build program (written in Java using SolrJ) tracks three values for every update -- the last did value in the main table, and the last id value in idx_delete and idx_reinsert. An update cycle (which we run at least once a minute) consists of reading new records in the idx tables, doing the deletes and reinserts using the main table identifiers found there, and then indexing new records from the main table. In each of those three tables, new records are identified by looking for rows with a primary key value that's higher than the last-recorded number. The build program has a "full rebuild" capability that leverages the dataimport handler on a set of build cores, which are swapped with the live cores when the rebuild completes. If the destination is SolrCloud, then core swapping won't work, but SolrCloud has the collection alias feature which can work much the same as core swapping. This works very well. There are many additional details to the implementation, but that's a high-level description of one way to keep a Solr index in sync with a database. I don't think I'd bother with the signature requirement you mentioned. As long as your uniqueKey is properly set up, indexing the same document again will just replace it in the index, and you won't need to worry about whether it is exactly the same as the previous version. If you actually want to do this, it looks like you were given a method by Upayavira. Thanks, Shawn