Hey Erick, thanks for the interesting reply. Indexing unicode characters is not a problem i see, nor indexing mails. I'm alraight with defining useless a word that is unique through all my index.
I will try reindexing strategy you proposed, though, as you said, having a few millions of stop words will not be an easy task to maintain. More to it i will reduce the memory chunks that get saved to the RAM, as most of it is trash. As my problem seems to be very specific I think i'll turn to the code to check how can i do it on my own. Hope this adventure will go well. Cheers, Manu On Wed, Apr 24, 2013 at 2:10 PM, Erick Erickson <erickerick...@gmail.com>wrote: > Even if you could know ahead of time, 7M stop words is a > lot to maintain. But assuming that your index is really > pretty static, you could consider building it once, then > creating the stopword file from unique terms and re-indexing. > > You could consider cleaning them on the input side or > creating a custom filter that, say, checked against a dictionary > (that you'd have to find). > > There's nothing that I know of that'll allow you to delete > unique terms from a static index. > > About a regex, you could use PatternReplaceCharFilterFactory > to remove them from your input stream, but the trick is defining > "useless". Part numbers are really useful in some situations > for instance. There's nothing "standard" because there's no > standard. You haven't, for instance, provided any criteria for > what "useless" is. Do you care about e-mails? What about > accents? Unicode? The list gets pretty endless. > > You should be able to write a regex that removes > everything non-alpha-numeric or some such for instance, > although even that is a problem if you're indexing anything but > plain-vanilla English. The Java pre-defined '\w', for instance, > refers to [a-zA-Z_0-9]. Nary an accented character in sight. > > > Best > Erick > > On Tue, Apr 23, 2013 at 3:53 PM, Manuel Le Normand > <manuel.lenorm...@gmail.com> wrote: > > Hi there, > > Looking at one of my shards (about 1M docs) i see lot of unique terms, > more > > than 8M which is a significant part of my total term count. These are > very > > likely useless terms, binaries or other meaningless numbers that come > with > > few of my docs. > > I am totally fine with deleting them so these terms would be > unsearchable. > > Thinking about it i get that > > 1. It is impossible apriori knowing if it is unique term or not, so i > > cannot add them to my stop words. > > 2. I have a performance decrease cause my cached chuncks do contain > useless > > data, and im short on memory. > > > > Assuming a constant index, is there a way of deleting all terms that are > > unique from at least the dictionary tim and tip files? Will i get > > significant query time performance increase? Does any body know a class > of > > regex that identify meaningless terms that i can add to my > updateProcessor? > > > > Thanks > > Manu >