Re: Does DIH commit during large import?

Shawn Heisey Tue, 21 Aug 2012 08:49:02 -0700

On 8/21/2012 6:41 AM, Alexandre Rafalovitch wrote:

I am doing an import of large records (with large full-text fields)
and somewhere around 300000 records DataImportHandler runs out of
memory (Heap) on a TIKA import (triggered from custom Processor) and
does roll-back. I am using store=false and trying some tricks and
tracking possible memory leaks, but also have a question about DIH
itself.


What actually happens when I run DIH on a large (XML Source) job? Does
it accumulate some sort of status in memory that it commits at the
end? If so, can I do intermediate commits to drop the memory
requirements? Or, will it help to do several passes over the same
dataset and import only particular entries at a time? I am using the
Solr 4 (alpha) UI, so I can see some of the options there.

I use Solr 3.5 and a MySQL database for import, so my setup may not becompletely relevant, but here is my experience.

Unless you turn on autocommit in solrconfig, documents will not besearchable during the import. If you have "commit=true" for DIH (whichI believe is the default), there will be a commit at the end of the import.

It looks like there's an out of memory issue filed on Solr 4 DIH withTika that is suspected to be a bug in Tika rather than Solr. The issuedetails talk about some workarounds for those who are familiar with Tika-- I'm not. The issue URL:


https://issues.apache.org/jira/browse/SOLR-2886

Thanks,
Shawn

Re: Does DIH commit during large import?

Reply via email to