Re: Dynamic Stopwords

2020-05-15 Thread Tim Casey
What I have done for this in the past is calculating the expected value of a symbol within a universe. Then calculating the difference between expected value and the actual value at the time you see a symbol. Take the difference and use the most surprising symbols, in rank order from most surpris

Re: Dynamic Stopwords

2020-05-15 Thread A Adel
Yes, significant terms have been calculated but they have the anomaly or relative shift nature rather than the high frequency, as suggested also by the blog post. So, it looks that adding a preprocessing step upstream in an additional field makes more sense in this case. The text is intrinsically n

Re: Dynamic Stopwords

2020-05-15 Thread Tim Casey
You do not need stop words to do what you need to do, For one thing, stop words requires a segmentation on a phrase-by-phrase basis in some cases. That is, especially in places like Europe, there is a lot of mixed language. (Your milage may vary :). In order to do what you want, you really need t

Re: Dynamic Stopwords

2020-05-15 Thread Walter Underwood
Right. I might use NLP to pull out noun phrases and entities. Entities are essential noun phrases with proper nouns. Put those in a separate field and build the word cloud from that. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 15, 2020, at 1

Re: Dynamic Stopwords

2020-05-15 Thread Doug Turnbull
You may want something more like "significant terms" - terms statistically significant in a document. Possibly not just based on doc freq https://saumitra.me/blog/solr-significant-terms/ On Fri, May 15, 2020 at 2:16 PM A Adel wrote: > Hi Walter, > > Thank you for your explanation, I understand

Re: Dynamic Stopwords

2020-05-15 Thread A Adel
Hi Walter, Thank you for your explanation, I understand the point and agree with you. However, the use case at hand is building a word cloud based on faceting the multilingual text field (very simple) which in case of not using stop words returns many generic terms, articles, etc. If stop words fi

Re: Dynamic Stopwords

2020-05-15 Thread Walter Underwood
Just don’t use stop words. That will give much better relevance and works for all languages. Stop words are an obsolete hack from the days of search engines running on 16 bit CPUs. They save space by throwing away important information. The classic example is “to be or not to be”, which is made

Dynamic Stopwords

2020-05-14 Thread A Adel
Hi - Is there a way to configure stop words to be dynamic for each document based on the language detected of a multilingual text field? Combining all languages stop words in one set is a possibility however it introduces false positives for some language combinations, such as German and English. T