Re: Any tips for indexing large amounts of data?

Otis Gospodnetic Thu, 22 Nov 2007 21:04:25 -0800

Brendan - yes, 64-bit Linux this is, and the JVM got 5.5 GB heap, though it 
could have worked with less.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Brendan Grainger <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, November 21, 2007 1:24:05 PM
Subject: Re: Any tips for indexing large amounts of data?

Hi Otis,

Thanks for this. Are you using a flavor of linux and is it 64bit? How  
much heap are you giving your jvm?

Thanks again
Brendan

On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote:

> Mike is right about the occasional slow-down, which appears as a  
> pause and is due to large Lucene index segment merging.  This  
> should go away with newer versions of Lucene where this is  
> happening in the background.
>
> That said, we just indexed about 20MM documents on a single 8-core  
> machine with 8 GB of RAM, resulting in nearly 20 GB index.  The  
> whole process took a little less than 10 hours - that's over 550  
> docs/second.  The vanilla approach before some of our changes  
> apparently required several days to index the same amount of data.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Mike Klaas <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Monday, November 19, 2007 5:50:19 PM
> Subject: Re: Any tips for indexing large amounts of data?
>
> There should be some slowdown in larger indices as occasionally large
> segment merge operations must occur.  However, this shouldn't really
> affect overall speed too much.
>
> You haven't really given us enough data to tell you anything useful.
> I would recommend trying to do the indexing via a webapp to eliminate
> all your code as a possible factor.  Then, look for signs to what is
> happening when indexing slows.  For instance, is Solr high in cpu, is
> the computer thrashing, etc?
>
> -Mike
>
> On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:
>
>> Hi,
>>
>> Thanks for answering this question a while back. I have made some
>> of the suggestions you mentioned. ie not committing until I've
>> finished indexing. What I am seeing though, is as the index get
>> larger (around 1Gb), indexing is taking a lot longer. In fact it
>> slows down to a crawl. Have you got any pointers as to what I might
>> be doing wrong?
>>
>> Also, I was looking at using MultiCore solr. Could this help in
>> some way?
>>
>> Thank you
>> Brendan
>>
>> On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:
>>
>>>
>>> : I would think you would see better performance by allowing auto
>>> commit
>>> : to handle the commit size instead of reopening the connection
>>> all the
>>> : time.
>>>
>>> if your goal is "fast" indexing, don't use autoCommit at all ...
>  just
>>> index everything, and don't commit until you are completely done.
>>>
>>> autoCommitting will slow your indexing down (the benefit being
>>> that more
>>> results will be visible to searchers as you proceed)
>>>
>>>
>>>
>>>
>>> -Hoss
>>>
>>
>
>
>
>

Re: Any tips for indexing large amounts of data?

Reply via email to