Re: Too many unique terms

Erick Erickson Wed, 24 Apr 2013 04:11:26 -0700

Even if you could know ahead of time, 7M stop words is a
lot to maintain. But assuming that your index is really
pretty static, you could consider building it once, then
creating the stopword file from unique terms and re-indexing.

You could consider cleaning them on the input side or
creating a custom filter that, say, checked against a dictionary
(that you'd have to find).

There's nothing that I know of that'll allow you to delete
unique terms from a static index.

About a regex, you could use PatternReplaceCharFilterFactory
to remove them from your input stream, but the trick is defining
"useless". Part numbers are really useful in some situations
for instance. There's nothing "standard" because there's no
standard. You haven't, for instance, provided any criteria for
what "useless" is. Do you care about e-mails? What about
accents? Unicode? The list gets pretty endless.

You should be able to write a regex that removes
everything non-alpha-numeric or some such for instance,
although even that is a problem if you're indexing anything but
plain-vanilla English. The Java pre-defined '\w', for instance,
refers to [a-zA-Z_0-9]. Nary an accented character in sight.

Best
Erick

On Tue, Apr 23, 2013 at 3:53 PM, Manuel Le Normand
<manuel.lenorm...@gmail.com> wrote:
> Hi there,
> Looking at one of my shards (about 1M docs) i see lot of unique terms, more
> than 8M which is a significant part of my total term count. These are very
> likely useless terms, binaries or other meaningless numbers that come with
> few of my docs.
> I am totally fine with deleting them so these terms would be unsearchable.
> Thinking about it i get that
> 1. It is impossible apriori knowing if it is unique term or not, so i
> cannot add them to my stop words.
> 2. I have a performance decrease cause my cached chuncks do contain useless
> data, and im short on memory.
>
> Assuming a constant index, is there a way of deleting all terms that are
> unique from at least the dictionary tim and tip files? Will i get
> significant query time performance increase? Does any body know a class of
> regex that identify meaningless terms that i can add to my updateProcessor?
>
> Thanks
> Manu

Re: Too many unique terms

Reply via email to