Even if you could know ahead of time, 7M stop words is a lot to maintain. But assuming that your index is really pretty static, you could consider building it once, then creating the stopword file from unique terms and re-indexing.
You could consider cleaning them on the input side or creating a custom filter that, say, checked against a dictionary (that you'd have to find). There's nothing that I know of that'll allow you to delete unique terms from a static index. About a regex, you could use PatternReplaceCharFilterFactory to remove them from your input stream, but the trick is defining "useless". Part numbers are really useful in some situations for instance. There's nothing "standard" because there's no standard. You haven't, for instance, provided any criteria for what "useless" is. Do you care about e-mails? What about accents? Unicode? The list gets pretty endless. You should be able to write a regex that removes everything non-alpha-numeric or some such for instance, although even that is a problem if you're indexing anything but plain-vanilla English. The Java pre-defined '\w', for instance, refers to [a-zA-Z_0-9]. Nary an accented character in sight. Best Erick On Tue, Apr 23, 2013 at 3:53 PM, Manuel Le Normand <manuel.lenorm...@gmail.com> wrote: > Hi there, > Looking at one of my shards (about 1M docs) i see lot of unique terms, more > than 8M which is a significant part of my total term count. These are very > likely useless terms, binaries or other meaningless numbers that come with > few of my docs. > I am totally fine with deleting them so these terms would be unsearchable. > Thinking about it i get that > 1. It is impossible apriori knowing if it is unique term or not, so i > cannot add them to my stop words. > 2. I have a performance decrease cause my cached chuncks do contain useless > data, and im short on memory. > > Assuming a constant index, is there a way of deleting all terms that are > unique from at least the dictionary tim and tip files? Will i get > significant query time performance increase? Does any body know a class of > regex that identify meaningless terms that i can add to my updateProcessor? > > Thanks > Manu