Re: DataImportHandler Robustness For Imports That Take A Long Time

Chris Harris Fri, 13 Mar 2009 12:02:03 -0700

Wouldn't this approach get confused if there was an error that caused
DIH to do a rollback? For example, suppose this happened:


* 1000 successful document adds
* The custom transformer saves some marker in the DB to signal that
the above docs have been successfully indexed
* The next document add throws an exception
* DIH, rather than doing a commit, rolls back the 1000 document adds

At this point my database marker says that the 1000 docs have been
successfully indexed, but the documents themselves are not actually in
the Solr index. Because by hypothesis my import query is defined in
terms of my DB marker, I'll never end up getting these docs into the
Solr index, even if I resolve the issue that causes the exception and
re-run the data import.

It seems like, to do a safe equivalent of your suggestion, I'd have to
somehow A) prevent DIH from doing any rollbacks, B) get DIH to do
auto-commits, and C) make my custom transformer update the DB marker
only immediately after an auto-commit.

On Mon, Mar 9, 2009 at 9:27 PM, Noble Paul നോബിള്‍  नोब्ळ्
<noble.p...@gmail.com> wrote:
> I recommend writing a simple transformer which can write an entry
> into db after n documents (say 1000). and modify your query to take to
> consider that entry so that subsequent imports will start from there.
>
> DIH does not write the last_index_time unless the import completes 
> successfully.
>
> On Tue, Mar 10, 2009 at 1:54 AM, Chris Harris <rygu...@gmail.com> wrote:
>> I have a dataset (7M-ish docs each of which is maybe 1-100K) that,
>> with my current indexing process, takes a few days or maybe a week to
>> put into Solr.  I'm considering maybe switching to indexing with the
>> DataImportHandler, but I'm concerned about the impact of this on
>> indexing robustness:
>>
>> If I understand DIH properly, then if Solr goes down for whatever
>> reason during an import, then DIH loses track of what it has and
>> hasn't yet indexed that round, and will thus probably do a lot of
>> redundant reimporting the next time you run an import command. (For
>> example, if DIH successfully imports row id 100, and then Solr dies
>> before the DIH import finishes, and then I restart Solr and start a
>> new delta-import, then I think DIH will import row id 100 again.) One
>> implication for my dataset seems to be that, unless Solr can actually
>> stay up for several days on end, then DIH will never finish importing
>> my data, even if I manage to keep Solr at, say, 99% uptime. This would
>> be fine if a full import took only a few hours. If full import could
>> take a week, though, this is slightly unnerving. (Sometimes you just
>> need to restart Solr. Or the machine itself, for that matter.)
>>
>> Are there any good ways around this with DIH? One potential option is
>> to give each row in the database table not only a
>> ModificationTimestamp column but also a DataImportHandlerTimestamp
>> column, and try to get DIH to update that column whenever it finishes
>> indexing a row. Then you'd modify the WHERE clause in the DIH config
>> so that instead of determining which rows to index with something like
>>
>>  WHERE ModificationTimestamp > dataimporter.last_index_time
>>
>> you'd use something like
>>
>>  WHERE ModificationTimestamp > SolrImportTimestamp
>>
>> In this way, hopefully, DIH can always pick up where it left off last time,
>> rather than trying to redo any work it might have actually managed
>> to do last round.
>>
>> (I'm using something along these lines with my current, non-DIH-based
>> indexing scheme.)
>>
>> Am I making sense here?
>>
>> Chris
>>
>
>
>
> --
> --Noble Paul
>

Re: DataImportHandler Robustness For Imports That Take A Long Time

Reply via email to