I have a dataset (7M-ish docs each of which is maybe 1-100K) that,
with my current indexing process, takes a few days or maybe a week to
put into Solr.  I'm considering maybe switching to indexing with the
DataImportHandler, but I'm concerned about the impact of this on
indexing robustness:

If I understand DIH properly, then if Solr goes down for whatever
reason during an import, then DIH loses track of what it has and
hasn't yet indexed that round, and will thus probably do a lot of
redundant reimporting the next time you run an import command. (For
example, if DIH successfully imports row id 100, and then Solr dies
before the DIH import finishes, and then I restart Solr and start a
new delta-import, then I think DIH will import row id 100 again.) One
implication for my dataset seems to be that, unless Solr can actually
stay up for several days on end, then DIH will never finish importing
my data, even if I manage to keep Solr at, say, 99% uptime. This would
be fine if a full import took only a few hours. If full import could
take a week, though, this is slightly unnerving. (Sometimes you just
need to restart Solr. Or the machine itself, for that matter.)

Are there any good ways around this with DIH? One potential option is
to give each row in the database table not only a
ModificationTimestamp column but also a DataImportHandlerTimestamp
column, and try to get DIH to update that column whenever it finishes
indexing a row. Then you'd modify the WHERE clause in the DIH config
so that instead of determining which rows to index with something like

  WHERE ModificationTimestamp > dataimporter.last_index_time

you'd use something like

  WHERE ModificationTimestamp > SolrImportTimestamp

In this way, hopefully, DIH can always pick up where it left off last time,
rather than trying to redo any work it might have actually managed
to do last round.

(I'm using something along these lines with my current, non-DIH-based
indexing scheme.)

Am I making sense here?

Chris

Reply via email to