You could do periodic small optimizes. The optimize command now includes 'maxSegments' which limits the target number of segments.
It is possible to write a Lucene program that collects a bunch of segments and annoints it as an index. This gives you a way to collect segments after you write them with the nomergepolicy. As long as you are strict about not writing duplicate records, you can shovel segments here and there and collect them into the real index as you please. Ugly? Yes. On Tue, Oct 5, 2010 at 4:12 PM, Michael McCandless <luc...@mikemccandless.com> wrote: > 4 weeks is a depressingly long time to re-index! > > Do you use multiple threads for indexing? Large RAM buffer size is > also good, but I think perf peaks out mabye around 512 MB (at least > based on past tests)? > > Believe it or not, merging is typically compute bound. It's costly to > decode & re-encode all the vInts. > > Larger merge factor is good because it means the postings are copied > fewer times, but, it's bad beacuse you could risk running out of > descriptors, and, if the OS doesn't have enough RAM, you'll start to > thin out the readahead that the OS can do (which makes the merge less > efficient since the disk heads are seeking more). > > Cutting over to SSDs would also be a good idea, but, kinda pricey > still ;) > > Do you do any deleting? > > Do you use stored fields and/or term vectors? If so, try to make > your docs "uniform" if possible, ie add the same fields in the same > order. This enables lucene to use bulk byte copy merging under the > hood. > > I wouldn't set such a huge merge factor that you effectively disable > all merging until the end... because, you want to take advantage of > the concurrency while you're indexing docs to get any/all merging done > that you can. To wait and do all merging in the end means you > serialize (unnecessarily) indexing & merging... > > Mike > > On Tue, Oct 5, 2010 at 2:40 PM, Burton-West, Tom <tburt...@umich.edu> wrote: >> Hi all, >> >> At some point we will need to re-build an index that totals about 3 >> terabytes in size (split over 12 shards). At our current indexing speed we >> estimate that this will take about 4 weeks. We would like to reduce that >> time. It appears that our main bottleneck is disk I/O during index merging. >> >> Each index is somewhere between 250 and 350GB. We are currently using a >> mergeFactor of 10 and a ramBufferSizeMB of 32MB. What this means is that >> for every approximately 320 MB, 3.2GB, and 32GB we get merges. We are >> doing this offline and will run an optimize at the end. What we would like >> to do is reduce the number of intermediate merges. We thought about just >> using a nomerge merge policy and then optimizing at the end, but suspect we >> would run out of filehandles and that merging 10,000 segments during an >> optimize might not be efficient. >> >> We would like to find some optimum mergeFactor somewhere between 0 (noMerge >> merge policy) and 1,000. (We are also planning to raise the ramBufferSizeMB >> significantly). >> >> What experience do others have using a large mergeFactor? >> >> Tom >> >> >> >> > -- Lance Norskog goks...@gmail.com