Hey Erick, thanks for the interesting reply.
Indexing unicode characters is not a problem i see, nor indexing mails. I'm
alraight with defining useless a word that is unique through all my index.

I will try reindexing strategy you proposed, though, as you said,  having a
few millions of stop words will not be an easy task to maintain. More to it
i will reduce the memory chunks that get saved to the RAM, as most of it is
trash.
As my problem seems to be very specific I think i'll turn to the code to
check how can i do it on my own. Hope this adventure will go well.
Cheers,
Manu


On Wed, Apr 24, 2013 at 2:10 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Even if you could know ahead of time, 7M stop words is a
> lot to maintain. But assuming that your index is really
> pretty static, you could consider building it once, then
> creating the stopword file from unique terms and re-indexing.
>
> You could consider cleaning them on the input side or
> creating a custom filter that, say, checked against a dictionary
> (that you'd have to find).
>
> There's nothing that I know of that'll allow you to delete
> unique terms from a static index.
>
> About a regex, you could use PatternReplaceCharFilterFactory
> to remove them from your input stream, but the trick is defining
> "useless". Part numbers are really useful in some situations
> for instance. There's nothing "standard" because there's no
> standard. You haven't, for instance, provided any criteria for
> what "useless" is. Do you care about e-mails? What about
> accents? Unicode? The list gets pretty endless.
>
> You should be able to write a regex that removes
> everything non-alpha-numeric or some such for instance,
> although even that is a problem if you're indexing anything but
> plain-vanilla English. The Java pre-defined '\w', for instance,
> refers to [a-zA-Z_0-9]. Nary an accented character in sight.
>
>
> Best
> Erick
>
> On Tue, Apr 23, 2013 at 3:53 PM, Manuel Le Normand
> <manuel.lenorm...@gmail.com> wrote:
> > Hi there,
> > Looking at one of my shards (about 1M docs) i see lot of unique terms,
> more
> > than 8M which is a significant part of my total term count. These are
> very
> > likely useless terms, binaries or other meaningless numbers that come
> with
> > few of my docs.
> > I am totally fine with deleting them so these terms would be
> unsearchable.
> > Thinking about it i get that
> > 1. It is impossible apriori knowing if it is unique term or not, so i
> > cannot add them to my stop words.
> > 2. I have a performance decrease cause my cached chuncks do contain
> useless
> > data, and im short on memory.
> >
> > Assuming a constant index, is there a way of deleting all terms that are
> > unique from at least the dictionary tim and tip files? Will i get
> > significant query time performance increase? Does any body know a class
> of
> > regex that identify meaningless terms that i can add to my
> updateProcessor?
> >
> > Thanks
> > Manu
>

Reply via email to