Re: Does DIH commit during large import?

Erick Erickson Wed, 22 Aug 2012 13:01:03 -0700

solrconfig.xml has a setting ramBufferSizeMB that can be set
to limit the memory consumed during indexing. When this limit
is reached, the buffers are flushed to the current segment. NOTE:
the segment is NOT closed, there is no implied commit here, and
the data will not be searchable until a commit happens.


Best
Erick

On Wed, Aug 22, 2012 at 7:10 AM, Alexandre Rafalovitch
<arafa...@gmail.com> wrote:
> Thanks, I will look into autoCommit.
>
> I assume there are memory implications of not committing? Or is it
> just writing in a separate file and can theoretically do it
> indefinitely?
>
> Regards,
>    Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Wed, Aug 22, 2012 at 2:42 AM, Lance Norskog <goks...@gmail.com> wrote:
>> Solr has a separate feature called 'autoCommit'. This is configured in
>> solrconfig.xml. You can set Solr to commit all documents every N
>> milliseconds or every N documents, whichever comes first. If you want
>> intermediate commits during a long DIH session, you have to use this
>> or make your own script that does commits.
>>
>> On Tue, Aug 21, 2012 at 8:48 AM, Shawn Heisey <s...@elyograg.org> wrote:
>>> On 8/21/2012 6:41 AM, Alexandre Rafalovitch wrote:
>>>>
>>>> I am doing an import of large records (with large full-text fields)
>>>> and somewhere around 300000 records DataImportHandler runs out of
>>>> memory (Heap) on a TIKA import (triggered from custom Processor) and
>>>> does roll-back. I am using store=false and trying some tricks and
>>>> tracking possible memory leaks, but also have a question about DIH
>>>> itself.
>>>>
>>>> What actually happens when I run DIH on a large (XML Source) job? Does
>>>> it accumulate some sort of status in memory that it commits at the
>>>> end? If so, can I do intermediate commits to drop the memory
>>>> requirements? Or, will it help to do several passes over the same
>>>> dataset and import only particular entries at a time? I am using the
>>>> Solr 4 (alpha) UI, so I can see some of the options there.
>>>
>>>
>>> I use Solr 3.5 and a MySQL database for import, so my setup may not be
>>> completely relevant, but here is my experience.
>>>
>>> Unless you turn on autocommit in solrconfig, documents will not be
>>> searchable during the import.  If you have "commit=true" for DIH (which I
>>> believe is the default), there will be a commit at the end of the import.
>>>
>>> It looks like there's an out of memory issue filed on Solr 4 DIH with Tika
>>> that is suspected to be a bug in Tika rather than Solr.  The issue details
>>> talk about some workarounds for those who are familiar with Tika -- I'm not.
>>> The issue URL:
>>>
>>> https://issues.apache.org/jira/browse/SOLR-2886
>>>
>>> Thanks,
>>> Shawn
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com

Re: Does DIH commit during large import?

Reply via email to