You do not need stop words to do what you need to do,  For one thing, stop
words requires a segmentation on a phrase-by-phrase basis in some cases.
That is, especially in places like Europe, there is a lot of mixed
language. (Your milage may vary :).

In order to do what you want, you really need to look at the statistical
value of all of the symbols in the universe of consideration.  Find the
relevant terms, throw out common terms and anything with a frequency below
5.  This is also language independent, and slang independent and fairly
medium independent.  If you need a more refined space, you can build the
symbol space from bigrams.

If I ever write a book the title is going to be "The The".  I hope it has
multi-lingual translations.  Although, at this point, it is a very short
book :/

tim

On Fri, May 15, 2020 at 11:43 AM Walter Underwood <wun...@wunderwood.org>
wrote:

> Right. I might use NLP to pull out noun phrases and entities. Entities are
> essential noun phrases with proper nouns.
>
> Put those in a separate field and build the word cloud from that.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> >
> > You may want something more like "significant terms" - terms
> statistically
> > significant in a document. Possibly not just based on doc freq
> >
> > https://saumitra.me/blog/solr-significant-terms/
> >
> > On Fri, May 15, 2020 at 2:16 PM A Adel <aa.0...@gmail.com> wrote:
> >
> >> Hi Walter,
> >>
> >> Thank you for your explanation, I understand the point and agree with
> you.
> >> However, the use case at hand is building a word cloud based on faceting
> >> the multilingual text field (very simple) which in case of not using
> stop
> >> words returns many generic terms, articles, etc. If stop words filter is
> >> not used, is there any other/better technique to be used instead to
> build a
> >> meaningful word cloud?
> >>
> >>
> >> On Fri, May 15, 2020, 5:20 PM Walter Underwood <wun...@wunderwood.org>
> >> wrote:
> >>
> >>> Just don’t use stop words. That will give much better relevance and
> works
> >>> for all languages.
> >>>
> >>> Stop words are an obsolete hack from the days of search engines running
> >>> on 16 bit CPUs. They save space by throwing away important information.
> >>>
> >>> The classic example is “to be or not to be”, which is made up entirely
> of
> >>> stop words. Remove them and it is impossible to search for that phrase.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>> On May 14, 2020, at 10:47 PM, A Adel <aa.0...@gmail.com> wrote:
> >>>>
> >>>> Hi - Is there a way to configure stop words to be dynamic for each
> >>> document
> >>>> based on the language detected of a multilingual text field? Combining
> >>> all
> >>>> languages stop words in one set is a possibility however it introduces
> >>>> false positives for some language combinations, such as German and
> >>> English.
> >>>> Thanks, A.
> >>>
> >>>
> >>
> >
> >
> > --
> > *Doug Turnbull **| CTO* | OpenSource Connections
> > <http://opensourceconnections.com>, LLC | 240.476.9983
> > Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
> > Powered Search <http://aipoweredsearch.com>*
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
>
>

Reply via email to