4 weeks is a depressingly long time to re-index!

Do you use multiple threads for indexing?  Large RAM buffer size is
also good, but I think perf peaks out mabye around 512 MB (at least
based on past tests)?

Believe it or not, merging is typically compute bound.  It's costly to
decode & re-encode all the vInts.

Larger merge factor is good because it means the postings are copied
fewer times, but, it's bad beacuse you could risk running out of
descriptors, and, if the OS doesn't have enough RAM, you'll start to
thin out the readahead that the OS can do (which makes the merge less
efficient since the disk heads are seeking more).

Cutting over to SSDs would also be a good idea, but, kinda pricey
still ;)

Do you do any deleting?

Do you use stored fields and/or term vectors?  If so, try to make
your docs "uniform" if possible, ie add the same fields in the same
order.  This enables lucene to use bulk byte copy merging under the
hood.

I wouldn't set such a huge merge factor that you effectively disable
all merging until the end... because, you want to take advantage of
the concurrency while you're indexing docs to get any/all merging done
that you can.  To wait and do all merging in the end means you
serialize (unnecessarily) indexing & merging...

Mike

On Tue, Oct 5, 2010 at 2:40 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Hi all,
>
> At some point we will need to re-build an index that totals about 3 terabytes 
> in size (split over 12 shards).  At our current indexing speed we estimate 
> that this will take about 4 weeks.  We would like to reduce that time.  It 
> appears that our main bottleneck is disk I/O during index merging.
>
> Each index is somewhere between 250 and 350GB.  We are currently using a 
> mergeFactor of 10 and a ramBufferSizeMB of 32MB.  What this means is that for 
> every approximately 320 MB, 3.2GB,  and 32GB we get merges.  We are doing 
> this offline and will run an optimize at the end.  What we would like to do 
> is reduce the number of intermediate merges.   We thought about just using a 
> nomerge merge policy and then optimizing at the end, but suspect we would run 
> out of filehandles and that merging 10,000 segments during an optimize might 
> not be efficient.
>
> We would like to find some optimum mergeFactor somewhere between 0 (noMerge 
> merge policy) and 1,000.  (We are also planning to raise the ramBufferSizeMB 
> significantly).
>
> What experience do others have using a large mergeFactor?
>
> Tom
>
>
>
>

Reply via email to