Re: Too many unique terms

2013-04-24 Thread Manuel LeNormand
Hey Erick, thanks for the interesting reply. Indexing unicode characters is not a problem i see, nor indexing mails. I'm alraight with defining useless a word that is unique through all my index. I will try reindexing strategy you proposed, though, as you said, having a few millions of stop words

Re: Too many unique terms

2013-04-24 Thread Erick Erickson
Even if you could know ahead of time, 7M stop words is a lot to maintain. But assuming that your index is really pretty static, you could consider building it once, then creating the stopword file from unique terms and re-indexing. You could consider cleaning them on the input side or creating a c

Too many unique terms

2013-04-23 Thread Manuel Le Normand
Hi there, Looking at one of my shards (about 1M docs) i see lot of unique terms, more than 8M which is a significant part of my total term count. These are very likely useless terms, binaries or other meaningless numbers that come with few of my docs. I am totally fine with deleting them so these t