Right. I might use NLP to pull out noun phrases and entities. Entities are 
essential noun phrases with proper nouns.

Put those in a separate field and build the word cloud from that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 15, 2020, at 11:39 AM, Doug Turnbull 
> <dturnb...@opensourceconnections.com> wrote:
> 
> You may want something more like "significant terms" - terms statistically
> significant in a document. Possibly not just based on doc freq
> 
> https://saumitra.me/blog/solr-significant-terms/
> 
> On Fri, May 15, 2020 at 2:16 PM A Adel <aa.0...@gmail.com> wrote:
> 
>> Hi Walter,
>> 
>> Thank you for your explanation, I understand the point and agree with you.
>> However, the use case at hand is building a word cloud based on faceting
>> the multilingual text field (very simple) which in case of not using stop
>> words returns many generic terms, articles, etc. If stop words filter is
>> not used, is there any other/better technique to be used instead to build a
>> meaningful word cloud?
>> 
>> 
>> On Fri, May 15, 2020, 5:20 PM Walter Underwood <wun...@wunderwood.org>
>> wrote:
>> 
>>> Just don’t use stop words. That will give much better relevance and works
>>> for all languages.
>>> 
>>> Stop words are an obsolete hack from the days of search engines running
>>> on 16 bit CPUs. They save space by throwing away important information.
>>> 
>>> The classic example is “to be or not to be”, which is made up entirely of
>>> stop words. Remove them and it is impossible to search for that phrase.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On May 14, 2020, at 10:47 PM, A Adel <aa.0...@gmail.com> wrote:
>>>> 
>>>> Hi - Is there a way to configure stop words to be dynamic for each
>>> document
>>>> based on the language detected of a multilingual text field? Combining
>>> all
>>>> languages stop words in one set is a possibility however it introduces
>>>> false positives for some language combinations, such as German and
>>> English.
>>>> Thanks, A.
>>> 
>>> 
>> 
> 
> 
> -- 
> *Doug Turnbull **| CTO* | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
> Powered Search <http://aipoweredsearch.com>*
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.

Reply via email to