On Sun, Sep 12, 2010 at 1:51 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom <tburt...@umich.edu> wrote: >> Is there an example of how to set up the divisor parameter in >> solrconfig.xml somewhere? > > Alas I don't know how to configure terms index divisor from Solr...
You can set the termIndexInterval via <indexDefaults> ... <termIndexInterval>128</termIndexInterval> ... </indexDefaults> which has the same effect but requires reindexing. I don't see that the index divisor is exposed but maybe we should do so! simon >>>>In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large >>>>parallel arrays instead of separate objects, and, >>>>we hold much less in RAM. Simply upgrading to 4.0 and re-indexing will >>>>show this gain...; >> >> I'm looking forward to a number of the developments in 4.0, but am a bit >> wary of using it in production. I've wanted to work in some tests with >> 4.0, but other more pressing issues have so far prevented this. > > Understood. > >> What about Lucene 2205? Would that be a way to get some of the benefit >> similar to the changes in flex without the rest of the changes in flex and >> 4.0? > > 2205 was a similar idea (don't create tons of small objects), but it > was never committed... > >>>>I'd be really curious to test the RAM reduction in 4.0 on your terms >>>>dict/index -- >>>>is there any way I could get a copy of just the tii/tis files in your >>>>index? Your index is a great test for Lucene! >> >> We haven't been able to make much data available due to copyright and other >> legal issues. However, since there is absolutely no way anyone could >> reconstruct copyrighted works from the tii/tis index alone, that should be >> ok on that front. On Monday I'll try to get legal/administrative clearance >> to provide the data and also ask around and see if I can get the ok to >> either find a spare hard drive to ship, or make some kind of sftp >> arrangement. Hopefully we will find a way to be able to do this. > > That would be awesome, thanks! > >> BTW Most of the terms are probably the result of dirty OCR and the impact >> is probably increased by our present "punctuation filter". When we re-index >> we plan to use a more intelligent filter that will truncate extremely long >> tokens on punctuation and we also plan to do some minimal prefiltering prior >> to sending documents to Solr for indexing. However, since with now have >> over 400 languages , we will have to be conservative in our filtering since >> we would rather index dirty OCR than risk not indexing legitimate content. > > Got it... it's a great test case for Lucene :) > > Mike >