Hey Erick, thanks for the interesting reply.
Indexing unicode characters is not a problem i see, nor indexing mails. I'm
alraight with defining useless a word that is unique through all my index.
I will try reindexing strategy you proposed, though, as you said, having a
few millions of stop words
Even if you could know ahead of time, 7M stop words is a
lot to maintain. But assuming that your index is really
pretty static, you could consider building it once, then
creating the stopword file from unique terms and re-indexing.
You could consider cleaning them on the input side or
creating a c