There is a trick: facets with only one occurrence tend to be mispellings
or dirt. You write a program to fetch the terms (Lucene's CheckIndex is
a great starting point) create a stopwords file.
Here's a data mining project: which languages are more vulnerable to
dirty OCR?
Burton-West, Tom wrote:
Thanks Mike,
Do you use a terms index divisor? Setting that to 2 would halve the
amount of RAM required but double (on average) the seek time to locate
a given term (but, depending on your queries, that seek time may still
be a negligible part of overall query time, ie the tradeoff could be very worth
it).
On Monday I plan to switch to Solr 1.4.1 on our test machine and experiment
with the index divisor. Is there an example of how to set up the divisor
parameter in solrconfig.xml somewhere?
In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large
parallel arrays instead of separate objects, and,
we hold much less in RAM. Simply upgrading to 4.0 and re-indexing will show
this gain...;
I'm looking forward to a number of the developments in 4.0, but am a bit wary
of using it in production. I've wanted to work in some tests with 4.0, but
other more pressing issues have so far prevented this.
What about Lucene 2205? Would that be a way to get some of the benefit similar
to the changes in flex without the rest of the changes in flex and 4.0?
I'd be really curious to test the RAM reduction in 4.0 on your terms
dict/index --
is there any way I could get a copy of just the tii/tis files in your index?
Your index is a great test for Lucene!
We haven't been able to make much data available due to copyright and other
legal issues. However, since there is absolutely no way anyone could
reconstruct copyrighted works from the tii/tis index alone, that should be ok
on that front. On Monday I'll try to get legal/administrative clearance to
provide the data and also ask around and see if I can get the ok to either find
a spare hard drive to ship, or make some kind of sftp arrangement. Hopefully
we will find a way to be able to do this.
BTW Most of the terms are probably the result of dirty OCR and the impact is probably
increased by our present "punctuation filter". When we re-index we plan to use
a more intelligent filter that will truncate extremely long tokens on punctuation and we
also plan to do some minimal prefiltering prior to sending documents to Solr for
indexing. However, since with now have over 400 languages , we will have to be
conservative in our filtering since we would rather index dirty OCR than risk not
indexing legitimate content.
Tom