I recommend writing a simple transformer which can write an entry
into db after n documents (say 1000). and modify your query to take to
consider that entry so that subsequent imports will start from there.

DIH does not write the last_index_time unless the import completes successfully.

On Tue, Mar 10, 2009 at 1:54 AM, Chris Harris <rygu...@gmail.com> wrote:
> I have a dataset (7M-ish docs each of which is maybe 1-100K) that,
> with my current indexing process, takes a few days or maybe a week to
> put into Solr.  I'm considering maybe switching to indexing with the
> DataImportHandler, but I'm concerned about the impact of this on
> indexing robustness:
>
> If I understand DIH properly, then if Solr goes down for whatever
> reason during an import, then DIH loses track of what it has and
> hasn't yet indexed that round, and will thus probably do a lot of
> redundant reimporting the next time you run an import command. (For
> example, if DIH successfully imports row id 100, and then Solr dies
> before the DIH import finishes, and then I restart Solr and start a
> new delta-import, then I think DIH will import row id 100 again.) One
> implication for my dataset seems to be that, unless Solr can actually
> stay up for several days on end, then DIH will never finish importing
> my data, even if I manage to keep Solr at, say, 99% uptime. This would
> be fine if a full import took only a few hours. If full import could
> take a week, though, this is slightly unnerving. (Sometimes you just
> need to restart Solr. Or the machine itself, for that matter.)
>
> Are there any good ways around this with DIH? One potential option is
> to give each row in the database table not only a
> ModificationTimestamp column but also a DataImportHandlerTimestamp
> column, and try to get DIH to update that column whenever it finishes
> indexing a row. Then you'd modify the WHERE clause in the DIH config
> so that instead of determining which rows to index with something like
>
>  WHERE ModificationTimestamp > dataimporter.last_index_time
>
> you'd use something like
>
>  WHERE ModificationTimestamp > SolrImportTimestamp
>
> In this way, hopefully, DIH can always pick up where it left off last time,
> rather than trying to redo any work it might have actually managed
> to do last round.
>
> (I'm using something along these lines with my current, non-DIH-based
> indexing scheme.)
>
> Am I making sense here?
>
> Chris
>



-- 
--Noble Paul

Reply via email to