On 9/22/2010 1:39 AM, Shashikant Kore wrote:
Hi,

I'm using DIH to index records from a database. After every update on
(MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I have
observed that if db updates and DIH import is happening concurrently, import
misses few records.

Here is how it happens.

The table has a column 'lastUpdated' which has default value of current
timestamp. Many records are added to database in a single transaction that
takes several seconds. For example, if 10,000 rows are being inserted, the
rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
18:21:26'. These rows become visible only after transaction is committed.
That happens at, say, '2010-09-20 18:21:30'.

If Solr is import gets triggered at '18:20:29', it will use a timestamp of
last import for delta query. This import will not see the records added in
the aforementioned transaction as transaction was not committed at that
instant. After this import, the dataimport.properties will have last index
time as '18:20:29'.  The next import will not able to get all the rows of
previously referred trasaction as some of the rows have timestamp earlier
than '18:20:29'.

While I am testing extreme conditions, there is a possibility of missing out
on some data.

I could not find any solution in Solr framework to handle this. The table
has an auto increment key, all updates are deletes followed by inserts. So,
having last_indexed_id would have helped, where last_indexed_id is the max
value of id fetched in that import. The query would then become "Select id
where id>last_indexed_id.' I suppose, Solr does not have any provision like
this.

Two options I could think of are:
(a) Ensure at application level that there are no concurrent DB updates and
DIH import requests going concurrently.
(b) Use exclusive locking during DB update

What is the best way to address this problem?

Shashi,

I was not solving the same problem, but perhaps you can adapt my solution to yours. My main problem was that I don't have a modified date in my database, and due to the size of the table, it is impractical to add one. Instead, I chose to track the database primary key (a simple autoincrement) outside of Solr and pass min/max values into DIH for it to use in the SELECT statement. You can see a simplified version of my entity here, with a URL showing how to send the parameters in via the dataimport GET:

http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html

The update script that runs every two minutes gets MAX(did) from the database, retrieves the minDid from a file on an NFS share, and runs a delta-import with those two values. When the import is reported successful, it writes the maxDid value to the minDid file on the network share for the next run. If the import fails, it sends an alarm and doesn't update the minDid.

Shawn

Reply via email to