On Sun, Sep 12, 2010 at 1:51 AM, Michael McCandless
<[email protected]> wrote:
> On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom <[email protected]> wrote:
>> Is there an example of how to set up the divisor parameter in
>> solrconfig.xml somewhere?
>
> Alas I don't know how to configure terms index divisor from Solr...
You can set the termIndexInterval via
<indexDefaults>
...
<termIndexInterval>128</termIndexInterval>
...
</indexDefaults>
which has the same effect but requires reindexing. I don't see that
the index divisor is exposed but maybe we should do so!
simon
>>>>In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large
>>>>parallel arrays instead of separate objects, and,
>>>>we hold much less in RAM. Simply upgrading to 4.0 and re-indexing will
>>>>show this gain...;
>>
>> I'm looking forward to a number of the developments in 4.0, but am a bit
>> wary of using it in production. I've wanted to work in some tests with
>> 4.0, but other more pressing issues have so far prevented this.
>
> Understood.
>
>> What about Lucene 2205? Would that be a way to get some of the benefit
>> similar to the changes in flex without the rest of the changes in flex and
>> 4.0?
>
> 2205 was a similar idea (don't create tons of small objects), but it
> was never committed...
>
>>>>I'd be really curious to test the RAM reduction in 4.0 on your terms
>>>>dict/index --
>>>>is there any way I could get a copy of just the tii/tis files in your
>>>>index? Your index is a great test for Lucene!
>>
>> We haven't been able to make much data available due to copyright and other
>> legal issues. However, since there is absolutely no way anyone could
>> reconstruct copyrighted works from the tii/tis index alone, that should be
>> ok on that front. On Monday I'll try to get legal/administrative clearance
>> to provide the data and also ask around and see if I can get the ok to
>> either find a spare hard drive to ship, or make some kind of sftp
>> arrangement. Hopefully we will find a way to be able to do this.
>
> That would be awesome, thanks!
>
>> BTW Most of the terms are probably the result of dirty OCR and the impact
>> is probably increased by our present "punctuation filter". When we re-index
>> we plan to use a more intelligent filter that will truncate extremely long
>> tokens on punctuation and we also plan to do some minimal prefiltering prior
>> to sending documents to Solr for indexing. However, since with now have
>> over 400 languages , we will have to be conservative in our filtering since
>> we would rather index dirty OCR than risk not indexing legitimate content.
>
> Got it... it's a great test case for Lucene :)
>
> Mike
>