There is a trick: facets with only one occurrence tend to be mispellings or dirt. You write a program to fetch the terms (Lucene's CheckIndex is a great starting point) create a stopwords file.

Here's a data mining project: which languages are more vulnerable to dirty OCR?

Burton-West, Tom wrote:
Thanks Mike,

Do you use a terms index divisor?  Setting that to 2 would halve the
amount of RAM required but double (on average) the seek time to locate
a given term (but, depending on your queries, that seek time may still
be a negligible part of overall query time, ie the tradeoff could be very worth 
it).
On Monday I plan to switch to Solr 1.4.1 on our test machine and experiment 
with the index divisor.  Is there an example of how to set up the divisor 
parameter in solrconfig.xml somewhere?

In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
parallel arrays instead of separate objects, and,
we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will show 
this gain...;
I'm looking forward to a number of the developments in 4.0, but am a bit wary 
of using it in production.   I've wanted to work in some tests with 4.0, but 
other more pressing issues have so far prevented this.

What about Lucene 2205?  Would that be a way to get some of the benefit similar 
to the changes in flex without the rest of the changes in flex and 4.0?

I'd be really curious to test the RAM reduction in 4.0 on your terms  
dict/index --
is there any way I could get a copy of just the tii/tis  files in your index?  
Your index is a great test for Lucene!
We haven't been able to make much data available due to copyright and other 
legal issues.  However, since there is absolutely no way anyone could 
reconstruct copyrighted works from the tii/tis index alone, that should be ok 
on that front.  On Monday I'll try to get legal/administrative clearance to 
provide the data and also ask around and see if I can get the ok to either find 
a spare hard drive to ship, or make some kind of sftp arrangement.  Hopefully 
we will find a way to be able to do this.

BTW  Most of the terms are probably the result of  dirty OCR and the impact is probably 
increased by our present "punctuation filter".  When we re-index we plan to use 
a more intelligent filter that will truncate extremely long tokens on punctuation and we 
also plan to do some minimal prefiltering prior to sending documents to Solr for 
indexing.  However, since with now have over 400 languages , we will have to be 
conservative in our filtering since we would rather  index dirty OCR than risk not 
indexing legitimate content.

Tom

Reply via email to