On 9/22/2010 1:39 AM, Shashikant Kore wrote:
Hi,
I'm using DIH to index records from a database. After every update on
(MySQL) DB, Solr DIH is invoked for delta import. In my tests, I have
observed that if db updates and DIH import is happening concurrently, import
misses few records.
Here is how it happens.
The table has a column 'lastUpdated' which has default value of current
timestamp. Many records are added to database in a single transaction that
takes several seconds. For example, if 10,000 rows are being inserted, the
rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
18:21:26'. These rows become visible only after transaction is committed.
That happens at, say, '2010-09-20 18:21:30'.
If Solr is import gets triggered at '18:20:29', it will use a timestamp of
last import for delta query. This import will not see the records added in
the aforementioned transaction as transaction was not committed at that
instant. After this import, the dataimport.properties will have last index
time as '18:20:29'. The next import will not able to get all the rows of
previously referred trasaction as some of the rows have timestamp earlier
than '18:20:29'.
While I am testing extreme conditions, there is a possibility of missing out
on some data.
I could not find any solution in Solr framework to handle this. The table
has an auto increment key, all updates are deletes followed by inserts. So,
having last_indexed_id would have helped, where last_indexed_id is the max
value of id fetched in that import. The query would then become "Select id
where id>last_indexed_id.' I suppose, Solr does not have any provision like
this.
Two options I could think of are:
(a) Ensure at application level that there are no concurrent DB updates and
DIH import requests going concurrently.
(b) Use exclusive locking during DB update
What is the best way to address this problem?
Shashi,
I was not solving the same problem, but perhaps you can adapt my
solution to yours. My main problem was that I don't have a modified
date in my database, and due to the size of the table, it is impractical
to add one. Instead, I chose to track the database primary key (a
simple autoincrement) outside of Solr and pass min/max values into DIH
for it to use in the SELECT statement. You can see a simplified version
of my entity here, with a URL showing how to send the parameters in via
the dataimport GET:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html
The update script that runs every two minutes gets MAX(did) from the
database, retrieves the minDid from a file on an NFS share, and runs a
delta-import with those two values. When the import is reported
successful, it writes the maxDid value to the minDid file on the network
share for the next run. If the import fails, it sends an alarm and
doesn't update the minDid.
Shawn