Re: Only indexing changed documents

Shawn Heisey Fri, 07 Aug 2015 08:41:08 -0700

On 8/7/2015 8:56 AM, Davis, Daniel (NIH/NLM) [C] wrote:
> I have an application that knows enough to tell me that a document has been 
> updated, but not which document has been updated.    There aren't that many 
> documents in this core/collection - just a couple of 1000.   So far I've just 
> been pumping them all to the update handler every week, but the business folk 
> really want the database and the index to be synchronized when the back-end 
> staff make an update.    As is typical in indexing, updates are more frequent 
> than searchers (or at least are expected to be once things pick-up - we may 
> even reach a whopping 10k documents at some point :))
> 
> Each document has an id I wish to use as the unique ID, but I also want to 
> compute a signature.   Is there some way I can use an 
> updateRequestProcessorChain to throw away a document if its signature and 
> document id match based on real-time get?
> 
> My apologies if this is a duplicate of a prior question - solr-user is faily 
> high traffic.


My main Solr indexes are each generated from a MySQL database.  One
contains over 100 million rows, another over 200 million.  A third
contains about 18 million.  Here's how we handle the requirement you
asked about:

The main table has a delete id column that is its primary key.  This is
an autoincrement column.  There is another unique index on another
column in that table, which is the canonical unique identifier, used as
Solr's uniqueKey.

The main table also has triggers for DELETE and UPDATE which add records
to the idx_delete table (contains delete id values) and idx_reinsert
table (contains unique key values).  These extra tables each have a
primary key on an autoincrement column.

The build program (written in Java using SolrJ) tracks three values for
every update -- the last did value in the main table, and the last id
value in idx_delete and idx_reinsert.

An update cycle (which we run at least once a minute) consists of
reading new records in the idx tables, doing the deletes and reinserts
using the main table identifiers found there, and then indexing new
records from the main table.  In each of those three tables, new records
are identified by looking for rows with a primary key value that's
higher than the last-recorded number.

The build program has a "full rebuild" capability that leverages the
dataimport handler on a set of build cores, which are swapped with the
live cores when the rebuild completes.  If the destination is SolrCloud,
then core swapping won't work, but SolrCloud has the collection alias
feature which can work much the same as core swapping.

This works very well.  There are many additional details to the
implementation, but that's a high-level description of one way to keep a
Solr index in sync with a database.

I don't think I'd bother with the signature requirement you mentioned.
As long as your uniqueKey is properly set up, indexing the same document
again will just replace it in the index, and you won't need to worry
about whether it is exactly the same as the previous version.  If you
actually want to do this, it looks like you were given a method by
Upayavira.

Thanks,
Shawn

Re: Only indexing changed documents

Reply via email to