On Sun, Sep 12, 2010 at 1:51 AM, Michael McCandless
<luc...@mikemccandless.com> wrote:
> On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom <tburt...@umich.edu> wrote:
>>  Is there an example of how to set up the divisor parameter in 
>> solrconfig.xml somewhere?
>
> Alas I don't know how to configure terms index divisor from Solr...

You can set the termIndexInterval via

<indexDefaults>
...
    <termIndexInterval>128</termIndexInterval>
...
</indexDefaults>

which has the same effect but requires reindexing. I don't see that
the index divisor is exposed but maybe we should do so!

simon
>>>>In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
>>>>parallel arrays instead of separate objects, and,
>>>>we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will 
>>>>show this gain...;
>>
>> I'm looking forward to a number of the developments in 4.0, but am a bit 
>> wary of using it in production.   I've wanted to work in some tests with 
>> 4.0, but other more pressing issues have so far prevented this.
>
> Understood.
>
>> What about Lucene 2205?  Would that be a way to get some of the benefit 
>> similar to the changes in flex without the rest of the changes in flex and 
>> 4.0?
>
> 2205 was a similar idea (don't create tons of small objects), but it
> was never committed...
>
>>>>I'd be really curious to test the RAM reduction in 4.0 on your terms  
>>>>dict/index --
>>>>is there any way I could get a copy of just the tii/tis  files in your 
>>>>index?  Your index is a great test for Lucene!
>>
>> We haven't been able to make much data available due to copyright and other 
>> legal issues.  However, since there is absolutely no way anyone could 
>> reconstruct copyrighted works from the tii/tis index alone, that should be 
>> ok on that front.  On Monday I'll try to get legal/administrative clearance 
>> to provide the data and also ask around and see if I can get the ok to 
>> either find a spare hard drive to ship, or make some kind of sftp 
>> arrangement.  Hopefully we will find a way to be able to do this.
>
> That would be awesome, thanks!
>
>> BTW  Most of the terms are probably the result of  dirty OCR and the impact 
>> is probably increased by our present "punctuation filter".  When we re-index 
>> we plan to use a more intelligent filter that will truncate extremely long 
>> tokens on punctuation and we also plan to do some minimal prefiltering prior 
>> to sending documents to Solr for indexing.  However, since with now have 
>> over 400 languages , we will have to be conservative in our filtering since 
>> we would rather  index dirty OCR than risk not indexing legitimate content.
>
> Got it... it's a great test case for Lucene :)
>
> Mike
>

Reply via email to