Right. I might use NLP to pull out noun phrases and entities. Entities are essential noun phrases with proper nouns.
Put those in a separate field and build the word cloud from that. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 15, 2020, at 11:39 AM, Doug Turnbull > <dturnb...@opensourceconnections.com> wrote: > > You may want something more like "significant terms" - terms statistically > significant in a document. Possibly not just based on doc freq > > https://saumitra.me/blog/solr-significant-terms/ > > On Fri, May 15, 2020 at 2:16 PM A Adel <aa.0...@gmail.com> wrote: > >> Hi Walter, >> >> Thank you for your explanation, I understand the point and agree with you. >> However, the use case at hand is building a word cloud based on faceting >> the multilingual text field (very simple) which in case of not using stop >> words returns many generic terms, articles, etc. If stop words filter is >> not used, is there any other/better technique to be used instead to build a >> meaningful word cloud? >> >> >> On Fri, May 15, 2020, 5:20 PM Walter Underwood <wun...@wunderwood.org> >> wrote: >> >>> Just don’t use stop words. That will give much better relevance and works >>> for all languages. >>> >>> Stop words are an obsolete hack from the days of search engines running >>> on 16 bit CPUs. They save space by throwing away important information. >>> >>> The classic example is “to be or not to be”, which is made up entirely of >>> stop words. Remove them and it is impossible to search for that phrase. >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>>> On May 14, 2020, at 10:47 PM, A Adel <aa.0...@gmail.com> wrote: >>>> >>>> Hi - Is there a way to configure stop words to be dynamic for each >>> document >>>> based on the language detected of a multilingual text field? Combining >>> all >>>> languages stop words in one set is a possibility however it introduces >>>> false positives for some language combinations, such as German and >>> English. >>>> Thanks, A. >>> >>> >> > > > -- > *Doug Turnbull **| CTO* | OpenSource Connections > <http://opensourceconnections.com>, LLC | 240.476.9983 > Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI > Powered Search <http://aipoweredsearch.com>* > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless > of whether attachments are marked as such.