Thanks Mike, >>Do you use a terms index divisor? Setting that to 2 would halve the >>amount of RAM required but double (on average) the seek time to locate >>a given term (but, depending on your queries, that seek time may still >>be a negligible part of overall query time, ie the tradeoff could be very >>worth it).
On Monday I plan to switch to Solr 1.4.1 on our test machine and experiment with the index divisor. Is there an example of how to set up the divisor parameter in solrconfig.xml somewhere? >>In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large >>parallel arrays instead of separate objects, and, >>we hold much less in RAM. Simply upgrading to 4.0 and re-indexing will show >>this gain...; I'm looking forward to a number of the developments in 4.0, but am a bit wary of using it in production. I've wanted to work in some tests with 4.0, but other more pressing issues have so far prevented this. What about Lucene 2205? Would that be a way to get some of the benefit similar to the changes in flex without the rest of the changes in flex and 4.0? >>I'd be really curious to test the RAM reduction in 4.0 on your terms >>dict/index -- >>is there any way I could get a copy of just the tii/tis files in your index? >> Your index is a great test for Lucene! We haven't been able to make much data available due to copyright and other legal issues. However, since there is absolutely no way anyone could reconstruct copyrighted works from the tii/tis index alone, that should be ok on that front. On Monday I'll try to get legal/administrative clearance to provide the data and also ask around and see if I can get the ok to either find a spare hard drive to ship, or make some kind of sftp arrangement. Hopefully we will find a way to be able to do this. BTW Most of the terms are probably the result of dirty OCR and the impact is probably increased by our present "punctuation filter". When we re-index we plan to use a more intelligent filter that will truncate extremely long tokens on punctuation and we also plan to do some minimal prefiltering prior to sending documents to Solr for indexing. However, since with now have over 400 languages , we will have to be conservative in our filtering since we would rather index dirty OCR than risk not indexing legitimate content. Tom