On 8/21/2012 6:41 AM, Alexandre Rafalovitch wrote:
I am doing an import of large records (with large full-text fields)
and somewhere around 300000 records DataImportHandler runs out of
memory (Heap) on a TIKA import (triggered from custom Processor) and
does roll-back. I am using store=false and trying some tricks and
tracking possible memory leaks, but also have a question about DIH
itself.
What actually happens when I run DIH on a large (XML Source) job? Does
it accumulate some sort of status in memory that it commits at the
end? If so, can I do intermediate commits to drop the memory
requirements? Or, will it help to do several passes over the same
dataset and import only particular entries at a time? I am using the
Solr 4 (alpha) UI, so I can see some of the options there.
I use Solr 3.5 and a MySQL database for import, so my setup may not be
completely relevant, but here is my experience.
Unless you turn on autocommit in solrconfig, documents will not be
searchable during the import. If you have "commit=true" for DIH (which
I believe is the default), there will be a commit at the end of the import.
It looks like there's an out of memory issue filed on Solr 4 DIH with
Tika that is suspected to be a bug in Tika rather than Solr. The issue
details talk about some workarounds for those who are familiar with Tika
-- I'm not. The issue URL:
https://issues.apache.org/jira/browse/SOLR-2886
Thanks,
Shawn