Re: Too many unique terms

2013-04-24 Thread Manuel LeNormand
Hey Erick, thanks for the interesting reply. Indexing unicode characters is not a problem i see, nor indexing mails. I'm alraight with defining useless a word that is unique through all my index. I will try reindexing strategy you proposed, though, as you said, having a few millions of stop words

Re: Too many unique terms

2013-04-24 Thread Erick Erickson
Even if you could know ahead of time, 7M stop words is a lot to maintain. But assuming that your index is really pretty static, you could consider building it once, then creating the stopword file from unique terms and re-indexing. You could consider cleaning them on the input side or creating a c