What I have done for this in the past is calculating the expected value of
a symbol within a universe.  Then calculating the difference between
expected value and the actual value at the time you see a symbol.  Take the
difference and use the most surprising symbols, in rank order from most
surprising to least surprising, dropping lower frequency/unique values.
This was a fairly length independent way to get to interesting tokens.

Most calculations around stop words are very difficult to maintain and
handle.  You can have 7 English stop words easy.  Then you go to a larger
set, say 30ish, then another larger set say 150.  The problem is as you
remove stop words, you remove some meaning.  You will see an example of
this when you want to know the difference between 'a noun' and 'the noun'.
  Now that we have covered English and chosen the optimal set of stop words
for a particular set of text, a new language comes around.  Eventually the
stop words become a contributing factor of error.  The other reason to not
use stop words is a corpus is usually a form of golden egg.  You might be
able to reindex it, but the cost is usually not free.  It is generally
better to have an honest index and allow the post analysis to change.  This
way you can change it 10 times a day and no one will care.

If you are interested in a word cloud I would suspect people have done a
reasonable job around this using a solr index already.

tim

On Fri, May 15, 2020 at 1:48 PM A Adel <aa.0...@gmail.com> wrote:

> Yes, significant terms have been calculated but they have the anomaly or
> relative shift nature rather than the high frequency, as suggested also by
> the blog post. So, it looks that adding a preprocessing step upstream in an
> additional field makes more sense in this case. The text is intrinsically
> not straightforward to parse (short free text) using mainstream NLP though.
>
> A.
>
> On Fri, May 15, 2020, 8:43 PM Walter Underwood <wun...@wunderwood.org>
> wrote:
>
> > Right. I might use NLP to pull out noun phrases and entities. Entities
> are
> > essential noun phrases with proper nouns.
> >
> > Put those in a separate field and build the word cloud from that.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> > dturnb...@opensourceconnections.com> wrote:
> > >
> > > You may want something more like "significant terms" - terms
> > statistically
> > > significant in a document. Possibly not just based on doc freq
> > >
> > > https://saumitra.me/blog/solr-significant-terms/
> > >
> > > On Fri, May 15, 2020 at 2:16 PM A Adel <aa.0...@gmail.com> wrote:
> > >
> > >> Hi Walter,
> > >>
> > >> Thank you for your explanation, I understand the point and agree with
> > you.
> > >> However, the use case at hand is building a word cloud based on
> faceting
> > >> the multilingual text field (very simple) which in case of not using
> > stop
> > >> words returns many generic terms, articles, etc. If stop words filter
> is
> > >> not used, is there any other/better technique to be used instead to
> > build a
> > >> meaningful word cloud?
> > >>
> > >>
> > >> On Fri, May 15, 2020, 5:20 PM Walter Underwood <wun...@wunderwood.org
> >
> > >> wrote:
> > >>
> > >>> Just don’t use stop words. That will give much better relevance and
> > works
> > >>> for all languages.
> > >>>
> > >>> Stop words are an obsolete hack from the days of search engines
> running
> > >>> on 16 bit CPUs. They save space by throwing away important
> information.
> > >>>
> > >>> The classic example is “to be or not to be”, which is made up
> entirely
> > of
> > >>> stop words. Remove them and it is impossible to search for that
> phrase.
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> > >>>> On May 14, 2020, at 10:47 PM, A Adel <aa.0...@gmail.com> wrote:
> > >>>>
> > >>>> Hi - Is there a way to configure stop words to be dynamic for each
> > >>> document
> > >>>> based on the language detected of a multilingual text field?
> Combining
> > >>> all
> > >>>> languages stop words in one set is a possibility however it
> introduces
> > >>>> false positives for some language combinations, such as German and
> > >>> English.
> > >>>> Thanks, A.
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > *Doug Turnbull **| CTO* | OpenSource Connections
> > > <http://opensourceconnections.com>, LLC | 240.476.9983
> > > Author: Relevant Search <http://manning.com/turnbull>; Contributor:
> *AI
> > > Powered Search <http://aipoweredsearch.com>*
> > > This e-mail and all contents, including attachments, is considered to
> be
> > > Company Confidential unless explicitly stated otherwise, regardless
> > > of whether attachments are marked as such.
> >
> >
>

Reply via email to