One thing that the Codec API makes possible ("in theory", anyway)... is variable gap terms index.
Ie, Lucene today makes an indexed term at regular (every N -- 128 in 3.x, 32 in 4.0) intervals. But this is rather silly. Imagine the terms you are going through are all singletons (happen only in one doc, eg if they are OCR noise or whatver). Maybe you have 500 such terms in sequence and then you hit a "real" term with a high freq. In this case, you don't really need to add any indexed terms from those 500, but then make the real term an indexed term. Because... a TermQuery against those singleton terms is going to be wicked fast, so you can afford the extra term-seek time. Whereas a TermQuery against a high-frequency term will be costly, so you want to minimize term-seek time. Such an approach could tremendously reduce the RAM required by the terms index w/ no appreciable hit to the worst-case queries (and possibly a slight improvement). Mike On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless <luc...@mikemccandless.com> wrote: > On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom <tburt...@umich.edu> wrote: >> Is there an example of how to set up the divisor parameter in >> solrconfig.xml somewhere? > > Alas I don't know how to configure terms index divisor from Solr... > >>>>In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large >>>>parallel arrays instead of separate objects, and, >>>>we hold much less in RAM. Simply upgrading to 4.0 and re-indexing will >>>>show this gain...; >> >> I'm looking forward to a number of the developments in 4.0, but am a bit >> wary of using it in production. I've wanted to work in some tests with >> 4.0, but other more pressing issues have so far prevented this. > > Understood. > >> What about Lucene 2205? Would that be a way to get some of the benefit >> similar to the changes in flex without the rest of the changes in flex and >> 4.0? > > 2205 was a similar idea (don't create tons of small objects), but it > was never committed... > >>>>I'd be really curious to test the RAM reduction in 4.0 on your terms >>>>dict/index -- >>>>is there any way I could get a copy of just the tii/tis files in your >>>>index? Your index is a great test for Lucene! >> >> We haven't been able to make much data available due to copyright and other >> legal issues. However, since there is absolutely no way anyone could >> reconstruct copyrighted works from the tii/tis index alone, that should be >> ok on that front. On Monday I'll try to get legal/administrative clearance >> to provide the data and also ask around and see if I can get the ok to >> either find a spare hard drive to ship, or make some kind of sftp >> arrangement. Hopefully we will find a way to be able to do this. > > That would be awesome, thanks! > >> BTW Most of the terms are probably the result of dirty OCR and the impact >> is probably increased by our present "punctuation filter". When we re-index >> we plan to use a more intelligent filter that will truncate extremely long >> tokens on punctuation and we also plan to do some minimal prefiltering prior >> to sending documents to Solr for indexing. However, since with now have >> over 400 languages , we will have to be conservative in our filtering since >> we would rather index dirty OCR than risk not indexing legitimate content. > > Got it... it's a great test case for Lucene :) > > Mike >