Re: Any tips for indexing large amounts of data?

Glen Newton Thu, 09 Apr 2009 08:41:31 -0700

For Solr / Lucene:
- use -XX:+AggressiveOpts
- If available, huge pages can help. See
http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html
 I haven't yet followed-up with my Lucene performance numbers using
huge pages: it is 10-15% for large indexing jobs.


For Lucene:
- multi-thread using java.util.concurrent.ThreadPoolExecutor
(http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
  6.4 million full-text article + metadata indexed resulting in 83GB
index; these are old number: things are down to ~10hours now)
- while multithreading on multicore is particularly good, it also
improves performance on single core, for small (<6 YMMV) numbers of
threads & good I/O (test for your particular configuration)
- Use multiple indexes & merge at the end
- As per http://developers.sun.com/learning/javaoneonline/2008/pdf/TS-5515.pdf
use separate ThreadPoolExecutor  per index in previous, reducing queue
contention. This is giving me an additional ~10%. I will blog about
this in the near future...

-glen

2009/4/9 sunnyfr <johanna...@gmail.com>:
>
> Hi Otis,
> How did you manage that? I've 8 core machine with 8GB of ram and 11GB index
> for 14M docs and 50000 update every 30mn but my replication kill everything.
> My segments are merged too often sor full index replicate and cache lost and
> .... I've no idea what can I do now?
> Some help would be brilliant,
> btw im using Solr 1.4.
>
> Thanks,
>
>
> Otis Gospodnetic wrote:
>>
>> Mike is right about the occasional slow-down, which appears as a pause and
>> is due to large Lucene index segment merging.  This should go away with
>> newer versions of Lucene where this is happening in the background.
>>
>> That said, we just indexed about 20MM documents on a single 8-core machine
>> with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process took
>> a little less than 10 hours - that's over 550 docs/second.  The vanilla
>> approach before some of our changes apparently required several days to
>> index the same amount of data.
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>> From: Mike Klaas <mike.kl...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Monday, November 19, 2007 5:50:19 PM
>> Subject: Re: Any tips for indexing large amounts of data?
>>
>> There should be some slowdown in larger indices as occasionally large
>> segment merge operations must occur.  However, this shouldn't really
>> affect overall speed too much.
>>
>> You haven't really given us enough data to tell you anything useful.
>> I would recommend trying to do the indexing via a webapp to eliminate
>> all your code as a possible factor.  Then, look for signs to what is
>> happening when indexing slows.  For instance, is Solr high in cpu, is
>> the computer thrashing, etc?
>>
>> -Mike
>>
>> On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:
>>
>>> Hi,
>>>
>>> Thanks for answering this question a while back. I have made some
>>> of the suggestions you mentioned. ie not committing until I've
>>> finished indexing. What I am seeing though, is as the index get
>>> larger (around 1Gb), indexing is taking a lot longer. In fact it
>>> slows down to a crawl. Have you got any pointers as to what I might
>>> be doing wrong?
>>>
>>> Also, I was looking at using MultiCore solr. Could this help in
>>> some way?
>>>
>>> Thank you
>>> Brendan
>>>
>>> On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:
>>>
>>>>
>>>> : I would think you would see better performance by allowing auto
>>>> commit
>>>> : to handle the commit size instead of reopening the connection
>>>> all the
>>>> : time.
>>>>
>>>> if your goal is "fast" indexing, don't use autoCommit at all ...
>>  just
>>>> index everything, and don't commit until you are completely done.
>>>>
>>>> autoCommitting will slow your indexing down (the benefit being
>>>> that more
>>>> results will be visible to searchers as you proceed)
>>>>
>>>>
>>>>
>>>>
>>>> -Hoss
>>>>
>>>
>>
>>
>>
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 

-

Re: Any tips for indexing large amounts of data?

Reply via email to