I have a dataset (7M-ish docs each of which is maybe 1-100K) that, with my current indexing process, takes a few days or maybe a week to put into Solr. I'm considering maybe switching to indexing with the DataImportHandler, but I'm concerned about the impact of this on indexing robustness:
If I understand DIH properly, then if Solr goes down for whatever reason during an import, then DIH loses track of what it has and hasn't yet indexed that round, and will thus probably do a lot of redundant reimporting the next time you run an import command. (For example, if DIH successfully imports row id 100, and then Solr dies before the DIH import finishes, and then I restart Solr and start a new delta-import, then I think DIH will import row id 100 again.) One implication for my dataset seems to be that, unless Solr can actually stay up for several days on end, then DIH will never finish importing my data, even if I manage to keep Solr at, say, 99% uptime. This would be fine if a full import took only a few hours. If full import could take a week, though, this is slightly unnerving. (Sometimes you just need to restart Solr. Or the machine itself, for that matter.) Are there any good ways around this with DIH? One potential option is to give each row in the database table not only a ModificationTimestamp column but also a DataImportHandlerTimestamp column, and try to get DIH to update that column whenever it finishes indexing a row. Then you'd modify the WHERE clause in the DIH config so that instead of determining which rows to index with something like WHERE ModificationTimestamp > dataimporter.last_index_time you'd use something like WHERE ModificationTimestamp > SolrImportTimestamp In this way, hopefully, DIH can always pick up where it left off last time, rather than trying to redo any work it might have actually managed to do last round. (I'm using something along these lines with my current, non-DIH-based indexing scheme.) Am I making sense here? Chris