On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom <tburt...@umich.edu> wrote: > Is there an example of how to set up the divisor parameter in solrconfig.xml > somewhere?
Alas I don't know how to configure terms index divisor from Solr... >>>In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large >>>parallel arrays instead of separate objects, and, >>>we hold much less in RAM. Simply upgrading to 4.0 and re-indexing will show >>>this gain...; > > I'm looking forward to a number of the developments in 4.0, but am a bit wary > of using it in production. I've wanted to work in some tests with 4.0, but > other more pressing issues have so far prevented this. Understood. > What about Lucene 2205? Would that be a way to get some of the benefit > similar to the changes in flex without the rest of the changes in flex and > 4.0? 2205 was a similar idea (don't create tons of small objects), but it was never committed... >>>I'd be really curious to test the RAM reduction in 4.0 on your terms >>>dict/index -- >>>is there any way I could get a copy of just the tii/tis files in your >>>index? Your index is a great test for Lucene! > > We haven't been able to make much data available due to copyright and other > legal issues. However, since there is absolutely no way anyone could > reconstruct copyrighted works from the tii/tis index alone, that should be ok > on that front. On Monday I'll try to get legal/administrative clearance to > provide the data and also ask around and see if I can get the ok to either > find a spare hard drive to ship, or make some kind of sftp arrangement. > Hopefully we will find a way to be able to do this. That would be awesome, thanks! > BTW Most of the terms are probably the result of dirty OCR and the impact > is probably increased by our present "punctuation filter". When we re-index > we plan to use a more intelligent filter that will truncate extremely long > tokens on punctuation and we also plan to do some minimal prefiltering prior > to sending documents to Solr for indexing. However, since with now have over > 400 languages , we will have to be conservative in our filtering since we > would rather index dirty OCR than risk not indexing legitimate content. Got it... it's a great test case for Lucene :) Mike