Re: Indexing gets significantly slower after every batch commit

Siegfried Goeschl Fri, 22 May 2015 02:41:23 -0700

Hi Angel,

a while ago I had issues with VMWare VM - somehow snapshots were created 
regularly which dragged down the machine. So I think is is a good idea to 
baseline the performance on physical box before moving to VMs, production boxes 
or whatever is thrown at you


Cheers,

Siegfried Goeschl

> On 22 May 2015, at 11:15, Angel Todorov <attodo...@gmail.com> wrote:
> 
> Thanks for the feedback guys. What i am going to try now is deploying my
> SOLR server on a physical machine with more RAM, and checking out this
> scenario there. I have some suspicion it could well be a hypervisor issue,
> but let's see. Just for the record - I've noticed those issues on a Win
> 2008R2 VM with 8 GB of RAM and 2 cores.
> 
> I don't see anything strange in the logs. One thing that I need to change,
> though, is the verbosity of logs in the console - looks like by default
> SOLR outputs text in the log for every single document that's indexed, as
> well as for every query that's executed.
> 
> Angel
> 
> 
> On Fri, May 22, 2015 at 1:03 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> bq: Which is logical as index growth and time needed to put something
>> to it is log(n)
>> 
>> Not really. Solr indexes to segments, each segment is a fully
>> consistent "mini index".
>> When a segment gets flushed to disk, a new one is started. Of course
>> there'll be a
>> _little bit_ of added overyead, but it shouldn't be all that noticeable.
>> 
>> Furthermore, they're "append only". In the past, when I've indexed the
>> Wiki example,
>> my indexing speed actually goes faster.
>> 
>> So on the surface this sounds very strange to me. Are you seeing
>> anything at all in the
>> Solr logs that's supsicious?
>> 
>> Best,
>> Erick
>> 
>> On Thu, May 21, 2015 at 12:22 PM, Sergey Shvets <ser...@bintime.com>
>> wrote:
>>> Hi Angel
>>> 
>>> We also noticed that kind of performance degrade in our workloads.
>>> 
>>> Which is logical as index growth and time needed to put something to it
>> is
>>> log(n)
>>> 
>>> 
>>> 
>>> четверг, 21 мая 2015 г. пользователь Angel Todorov написал:
>>> 
>>>> hi Shawn,
>>>> 
>>>> Thanks a bunch for your feedback. I've played with the heap size, but I
>>>> don't see any improvement. Even if i index, say , a million docs, and
>> the
>>>> throughput is about 300 docs per sec, and then I shut down solr
>> completely
>>>> - after I start indexing again, the throughput is dropping below 300.
>>>> 
>>>> I should probably experiment with sharding those documents to multiple
>> SOLR
>>>> cores - that should help, I guess. I am talking about something like
>> this:
>>>> 
>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>>>> 
>>>> Thanks,
>>>> Angel
>>>> 
>>>> 
>>>> On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey <apa...@elyograg.org
>>>> <javascript:;>> wrote:
>>>> 
>>>>> On 5/21/2015 2:07 AM, Angel Todorov wrote:
>>>>>> I'm crawling a file system folder and indexing 10 million docs, and
>> I
>>>> am
>>>>>> adding them in batches of 5000, committing every 50 000 docs. The
>>>>> problem I
>>>>>> am facing is that after each commit, the documents per sec that are
>>>>> indexed
>>>>>> gets less and less.
>>>>>> 
>>>>>> If I do not commit at all, I can index those docs very quickly, and
>>>> then
>>>>> I
>>>>>> commit once at the end, but once i start indexing docs _after_ that
>>>> (for
>>>>>> example new files get added to the folder), indexing is also slowing
>>>>> down a
>>>>>> lot.
>>>>>> 
>>>>>> Is it normal that the SOLR indexing speed depends on the number of
>>>>>> documents that are _already_ indexed? I think it shouldn't matter
>> if i
>>>>>> start from scratch or I index a document in a core that already has
>> a
>>>>>> couple of million docs. Looks like SOLR is either doing something
>> in a
>>>>>> linear fashion, or there is some magic config parameter that I am
>> not
>>>>> aware
>>>>>> of.
>>>>>> 
>>>>>> I've read all perf docs, and I've tried changing mergeFactor,
>>>>>> autowarmCounts, and the buffer sizes - to no avail.
>>>>>> 
>>>>>> I am using SOLR 5.1
>>>>> 
>>>>> Have you changed the heap size?  If you use the bin/solr script to
>> start
>>>>> it and don't change the heap size with the -m option or another
>> method,
>>>>> Solr 5.1 runs with a default size of 512MB, which is *very* small.
>>>>> 
>>>>> I bet you are running into problems with frequent and then ultimately
>>>>> constant garbage collection, as Java attempts to free up enough memory
>>>>> to allow the program to continue running.  If that is what is
>> happening,
>>>>> then eventually you will see an OutOfMemoryError exception.  The
>>>>> solution is to increase the heap size.  I would probably start with at
>>>>> least 4G for 10 million docs.
>>>>> 
>>>>> Thanks,
>>>>> Shawn
>>>>> 
>>>>> 
>>>> 
>>

Re: Indexing gets significantly slower after every batch commit

Reply via email to