alternately you can do the commit yourself after marking in the db . Context#getSolrCore().getUpdateHandler().commit()
or as you mentioned you can do an autocommit On Sat, Mar 14, 2009 at 12:31 AM, Chris Harris <rygu...@gmail.com> wrote: > Wouldn't this approach get confused if there was an error that caused > DIH to do a rollback? For example, suppose this happened: > > * 1000 successful document adds > * The custom transformer saves some marker in the DB to signal that > the above docs have been successfully indexed > * The next document add throws an exception > * DIH, rather than doing a commit, rolls back the 1000 document adds > > At this point my database marker says that the 1000 docs have been > successfully indexed, but the documents themselves are not actually in > the Solr index. Because by hypothesis my import query is defined in > terms of my DB marker, I'll never end up getting these docs into the > Solr index, even if I resolve the issue that causes the exception and > re-run the data import. > > It seems like, to do a safe equivalent of your suggestion, I'd have to > somehow A) prevent DIH from doing any rollbacks, B) get DIH to do > auto-commits, and C) make my custom transformer update the DB marker > only immediately after an auto-commit. > > On Mon, Mar 9, 2009 at 9:27 PM, Noble Paul നോബിള് नोब्ळ् > <noble.p...@gmail.com> wrote: >> I recommend writing a simple transformer which can write an entry >> into db after n documents (say 1000). and modify your query to take to >> consider that entry so that subsequent imports will start from there. >> >> DIH does not write the last_index_time unless the import completes >> successfully. >> >> On Tue, Mar 10, 2009 at 1:54 AM, Chris Harris <rygu...@gmail.com> wrote: >>> I have a dataset (7M-ish docs each of which is maybe 1-100K) that, >>> with my current indexing process, takes a few days or maybe a week to >>> put into Solr. I'm considering maybe switching to indexing with the >>> DataImportHandler, but I'm concerned about the impact of this on >>> indexing robustness: >>> >>> If I understand DIH properly, then if Solr goes down for whatever >>> reason during an import, then DIH loses track of what it has and >>> hasn't yet indexed that round, and will thus probably do a lot of >>> redundant reimporting the next time you run an import command. (For >>> example, if DIH successfully imports row id 100, and then Solr dies >>> before the DIH import finishes, and then I restart Solr and start a >>> new delta-import, then I think DIH will import row id 100 again.) One >>> implication for my dataset seems to be that, unless Solr can actually >>> stay up for several days on end, then DIH will never finish importing >>> my data, even if I manage to keep Solr at, say, 99% uptime. This would >>> be fine if a full import took only a few hours. If full import could >>> take a week, though, this is slightly unnerving. (Sometimes you just >>> need to restart Solr. Or the machine itself, for that matter.) >>> >>> Are there any good ways around this with DIH? One potential option is >>> to give each row in the database table not only a >>> ModificationTimestamp column but also a DataImportHandlerTimestamp >>> column, and try to get DIH to update that column whenever it finishes >>> indexing a row. Then you'd modify the WHERE clause in the DIH config >>> so that instead of determining which rows to index with something like >>> >>> WHERE ModificationTimestamp > dataimporter.last_index_time >>> >>> you'd use something like >>> >>> WHERE ModificationTimestamp > SolrImportTimestamp >>> >>> In this way, hopefully, DIH can always pick up where it left off last time, >>> rather than trying to redo any work it might have actually managed >>> to do last round. >>> >>> (I'm using something along these lines with my current, non-DIH-based >>> indexing scheme.) >>> >>> Am I making sense here? >>> >>> Chris >>> >> >> >> >> -- >> --Noble Paul >> > -- --Noble Paul