What I have done for this in the past is calculating the expected value of
a symbol within a universe. Then calculating the difference between
expected value and the actual value at the time you see a symbol. Take the
difference and use the most surprising symbols, in rank order from most
surpris
Yes, significant terms have been calculated but they have the anomaly or
relative shift nature rather than the high frequency, as suggested also by
the blog post. So, it looks that adding a preprocessing step upstream in an
additional field makes more sense in this case. The text is intrinsically
n
You do not need stop words to do what you need to do, For one thing, stop
words requires a segmentation on a phrase-by-phrase basis in some cases.
That is, especially in places like Europe, there is a lot of mixed
language. (Your milage may vary :).
In order to do what you want, you really need t
Right. I might use NLP to pull out noun phrases and entities. Entities are
essential noun phrases with proper nouns.
Put those in a separate field and build the word cloud from that.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On May 15, 2020, at 1
You may want something more like "significant terms" - terms statistically
significant in a document. Possibly not just based on doc freq
https://saumitra.me/blog/solr-significant-terms/
On Fri, May 15, 2020 at 2:16 PM A Adel wrote:
> Hi Walter,
>
> Thank you for your explanation, I understand
Hi Walter,
Thank you for your explanation, I understand the point and agree with you.
However, the use case at hand is building a word cloud based on faceting
the multilingual text field (very simple) which in case of not using stop
words returns many generic terms, articles, etc. If stop words fi
Just don’t use stop words. That will give much better relevance and works
for all languages.
Stop words are an obsolete hack from the days of search engines running
on 16 bit CPUs. They save space by throwing away important information.
The classic example is “to be or not to be”, which is made
Hi - Is there a way to configure stop words to be dynamic for each document
based on the language detected of a multilingual text field? Combining all
languages stop words in one set is a possibility however it introduces
false positives for some language combinations, such as German and English.
T