I recommend writing a simple transformer which can write an entry into db after n documents (say 1000). and modify your query to take to consider that entry so that subsequent imports will start from there.
DIH does not write the last_index_time unless the import completes successfully. On Tue, Mar 10, 2009 at 1:54 AM, Chris Harris <rygu...@gmail.com> wrote: > I have a dataset (7M-ish docs each of which is maybe 1-100K) that, > with my current indexing process, takes a few days or maybe a week to > put into Solr. I'm considering maybe switching to indexing with the > DataImportHandler, but I'm concerned about the impact of this on > indexing robustness: > > If I understand DIH properly, then if Solr goes down for whatever > reason during an import, then DIH loses track of what it has and > hasn't yet indexed that round, and will thus probably do a lot of > redundant reimporting the next time you run an import command. (For > example, if DIH successfully imports row id 100, and then Solr dies > before the DIH import finishes, and then I restart Solr and start a > new delta-import, then I think DIH will import row id 100 again.) One > implication for my dataset seems to be that, unless Solr can actually > stay up for several days on end, then DIH will never finish importing > my data, even if I manage to keep Solr at, say, 99% uptime. This would > be fine if a full import took only a few hours. If full import could > take a week, though, this is slightly unnerving. (Sometimes you just > need to restart Solr. Or the machine itself, for that matter.) > > Are there any good ways around this with DIH? One potential option is > to give each row in the database table not only a > ModificationTimestamp column but also a DataImportHandlerTimestamp > column, and try to get DIH to update that column whenever it finishes > indexing a row. Then you'd modify the WHERE clause in the DIH config > so that instead of determining which rows to index with something like > > WHERE ModificationTimestamp > dataimporter.last_index_time > > you'd use something like > > WHERE ModificationTimestamp > SolrImportTimestamp > > In this way, hopefully, DIH can always pick up where it left off last time, > rather than trying to redo any work it might have actually managed > to do last round. > > (I'm using something along these lines with my current, non-DIH-based > indexing scheme.) > > Am I making sense here? > > Chris > -- --Noble Paul