Thanks Mike,

>>Do you use a terms index divisor?  Setting that to 2 would halve the
>>amount of RAM required but double (on average) the seek time to locate
>>a given term (but, depending on your queries, that seek time may still
>>be a negligible part of overall query time, ie the tradeoff could be very 
>>worth it).

On Monday I plan to switch to Solr 1.4.1 on our test machine and experiment 
with the index divisor.  Is there an example of how to set up the divisor 
parameter in solrconfig.xml somewhere?

>>In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
>>parallel arrays instead of separate objects, and, 
>>we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will show 
>>this gain...; 

I'm looking forward to a number of the developments in 4.0, but am a bit wary 
of using it in production.   I've wanted to work in some tests with 4.0, but 
other more pressing issues have so far prevented this.

What about Lucene 2205?  Would that be a way to get some of the benefit similar 
to the changes in flex without the rest of the changes in flex and 4.0?

>>I'd be really curious to test the RAM reduction in 4.0 on your terms  
>>dict/index -- 
>>is there any way I could get a copy of just the tii/tis  files in your index? 
>> Your index is a great test for Lucene!

We haven't been able to make much data available due to copyright and other 
legal issues.  However, since there is absolutely no way anyone could 
reconstruct copyrighted works from the tii/tis index alone, that should be ok 
on that front.  On Monday I'll try to get legal/administrative clearance to 
provide the data and also ask around and see if I can get the ok to either find 
a spare hard drive to ship, or make some kind of sftp arrangement.  Hopefully 
we will find a way to be able to do this.

BTW  Most of the terms are probably the result of  dirty OCR and the impact is 
probably increased by our present "punctuation filter".  When we re-index we 
plan to use a more intelligent filter that will truncate extremely long tokens 
on punctuation and we also plan to do some minimal prefiltering prior to 
sending documents to Solr for indexing.  However, since with now have over 400 
languages , we will have to be conservative in our filtering since we would 
rather  index dirty OCR than risk not indexing legitimate content.  

Tom

Reply via email to