4 weeks is a depressingly long time to re-index! Do you use multiple threads for indexing? Large RAM buffer size is also good, but I think perf peaks out mabye around 512 MB (at least based on past tests)?
Believe it or not, merging is typically compute bound. It's costly to decode & re-encode all the vInts. Larger merge factor is good because it means the postings are copied fewer times, but, it's bad beacuse you could risk running out of descriptors, and, if the OS doesn't have enough RAM, you'll start to thin out the readahead that the OS can do (which makes the merge less efficient since the disk heads are seeking more). Cutting over to SSDs would also be a good idea, but, kinda pricey still ;) Do you do any deleting? Do you use stored fields and/or term vectors? If so, try to make your docs "uniform" if possible, ie add the same fields in the same order. This enables lucene to use bulk byte copy merging under the hood. I wouldn't set such a huge merge factor that you effectively disable all merging until the end... because, you want to take advantage of the concurrency while you're indexing docs to get any/all merging done that you can. To wait and do all merging in the end means you serialize (unnecessarily) indexing & merging... Mike On Tue, Oct 5, 2010 at 2:40 PM, Burton-West, Tom <tburt...@umich.edu> wrote: > Hi all, > > At some point we will need to re-build an index that totals about 3 terabytes > in size (split over 12 shards). At our current indexing speed we estimate > that this will take about 4 weeks. We would like to reduce that time. It > appears that our main bottleneck is disk I/O during index merging. > > Each index is somewhere between 250 and 350GB. We are currently using a > mergeFactor of 10 and a ramBufferSizeMB of 32MB. What this means is that for > every approximately 320 MB, 3.2GB, and 32GB we get merges. We are doing > this offline and will run an optimize at the end. What we would like to do > is reduce the number of intermediate merges. We thought about just using a > nomerge merge policy and then optimizing at the end, but suspect we would run > out of filehandles and that merging 10,000 segments during an optimize might > not be efficient. > > We would like to find some optimum mergeFactor somewhere between 0 (noMerge > merge policy) and 1,000. (We are also planning to raise the ramBufferSizeMB > significantly). > > What experience do others have using a large mergeFactor? > > Tom > > > >