Yes, significant terms have been calculated but they have the anomaly or
relative shift nature rather than the high frequency, as suggested also by
the blog post. So, it looks that adding a preprocessing step upstream in an
additional field makes more sense in this case. The text is intrinsically
not straightforward to parse (short free text) using mainstream NLP though.

A.

On Fri, May 15, 2020, 8:43 PM Walter Underwood <wun...@wunderwood.org>
wrote:

> Right. I might use NLP to pull out noun phrases and entities. Entities are
> essential noun phrases with proper nouns.
>
> Put those in a separate field and build the word cloud from that.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> >
> > You may want something more like "significant terms" - terms
> statistically
> > significant in a document. Possibly not just based on doc freq
> >
> > https://saumitra.me/blog/solr-significant-terms/
> >
> > On Fri, May 15, 2020 at 2:16 PM A Adel <aa.0...@gmail.com> wrote:
> >
> >> Hi Walter,
> >>
> >> Thank you for your explanation, I understand the point and agree with
> you.
> >> However, the use case at hand is building a word cloud based on faceting
> >> the multilingual text field (very simple) which in case of not using
> stop
> >> words returns many generic terms, articles, etc. If stop words filter is
> >> not used, is there any other/better technique to be used instead to
> build a
> >> meaningful word cloud?
> >>
> >>
> >> On Fri, May 15, 2020, 5:20 PM Walter Underwood <wun...@wunderwood.org>
> >> wrote:
> >>
> >>> Just don’t use stop words. That will give much better relevance and
> works
> >>> for all languages.
> >>>
> >>> Stop words are an obsolete hack from the days of search engines running
> >>> on 16 bit CPUs. They save space by throwing away important information.
> >>>
> >>> The classic example is “to be or not to be”, which is made up entirely
> of
> >>> stop words. Remove them and it is impossible to search for that phrase.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>> On May 14, 2020, at 10:47 PM, A Adel <aa.0...@gmail.com> wrote:
> >>>>
> >>>> Hi - Is there a way to configure stop words to be dynamic for each
> >>> document
> >>>> based on the language detected of a multilingual text field? Combining
> >>> all
> >>>> languages stop words in one set is a possibility however it introduces
> >>>> false positives for some language combinations, such as German and
> >>> English.
> >>>> Thanks, A.
> >>>
> >>>
> >>
> >
> >
> > --
> > *Doug Turnbull **| CTO* | OpenSource Connections
> > <http://opensourceconnections.com>, LLC | 240.476.9983
> > Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
> > Powered Search <http://aipoweredsearch.com>*
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
>
>

Reply via email to