Hey Erick, thanks for the interesting reply.
Indexing unicode characters is not a problem i see, nor indexing mails. I'm
alraight with defining useless a word that is unique through all my index.
I will try reindexing strategy you proposed, though, as you said, having a
few millions of stop words
Even if you could know ahead of time, 7M stop words is a
lot to maintain. But assuming that your index is really
pretty static, you could consider building it once, then
creating the stopword file from unique terms and re-indexing.
You could consider cleaning them on the input side or
creating a c
Hi there,
Looking at one of my shards (about 1M docs) i see lot of unique terms, more
than 8M which is a significant part of my total term count. These are very
likely useless terms, binaries or other meaningless numbers that come with
few of my docs.
I am totally fine with deleting them so these t