On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom <tburt...@umich.edu> wrote:
>  Is there an example of how to set up the divisor parameter in solrconfig.xml 
> somewhere?

Alas I don't know how to configure terms index divisor from Solr...

>>>In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
>>>parallel arrays instead of separate objects, and,
>>>we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will show 
>>>this gain...;
>
> I'm looking forward to a number of the developments in 4.0, but am a bit wary 
> of using it in production.   I've wanted to work in some tests with 4.0, but 
> other more pressing issues have so far prevented this.

Understood.

> What about Lucene 2205?  Would that be a way to get some of the benefit 
> similar to the changes in flex without the rest of the changes in flex and 
> 4.0?

2205 was a similar idea (don't create tons of small objects), but it
was never committed...

>>>I'd be really curious to test the RAM reduction in 4.0 on your terms  
>>>dict/index --
>>>is there any way I could get a copy of just the tii/tis  files in your 
>>>index?  Your index is a great test for Lucene!
>
> We haven't been able to make much data available due to copyright and other 
> legal issues.  However, since there is absolutely no way anyone could 
> reconstruct copyrighted works from the tii/tis index alone, that should be ok 
> on that front.  On Monday I'll try to get legal/administrative clearance to 
> provide the data and also ask around and see if I can get the ok to either 
> find a spare hard drive to ship, or make some kind of sftp arrangement.  
> Hopefully we will find a way to be able to do this.

That would be awesome, thanks!

> BTW  Most of the terms are probably the result of  dirty OCR and the impact 
> is probably increased by our present "punctuation filter".  When we re-index 
> we plan to use a more intelligent filter that will truncate extremely long 
> tokens on punctuation and we also plan to do some minimal prefiltering prior 
> to sending documents to Solr for indexing.  However, since with now have over 
> 400 languages , we will have to be conservative in our filtering since we 
> would rather  index dirty OCR than risk not indexing legitimate content.

Got it... it's a great test case for Lucene :)

Mike

Reply via email to