Re: Re: Protecting Tokens from Any Analysis

David Hastings Wed, 09 Oct 2019 07:56:10 -0700

another add on, as the previous two were pretty much spot on:

https://www.google.com/search?rlz=1C5CHFA_enUS814US819&sxsrf=ACYBGNTi2tQTQH6TycDKwRNEn9g2km9awg%3A1570632176627&ei=8PGdXa7tJeem_QaatJ_oAg&q=drive+in&oq=drive+in&gs_l=psy-ab.3..0l10.35669.36730..37042...0.4..1.434.1152.4j3j4-1......0....1..gws-wiz.......0i71j35i39j0i273j0i67j0i131j0i273i70i249.agjl1cqAyog&ved=0ahUKEwiupdfntI_lAhVnU98KHRraBy0Q4dUDCAs&uact=5


vs

https://www.google.com/search?rlz=1C5CHFA_enUS814US819&sxsrf=ACYBGNRFNjzWADDR7awohPfgg8qGXqOlmg%3A1570632182338&ei=9vGdXZ2VFKW8ggeuw73IDQ&q=drive+on&oq=drive+on&gs_l=psy-ab.3..0l10.35301.37396..37917...0.4..0.83.590.8....2..0....1..gws-wiz.......0i71j35i39j0i273j0i131j0i67j0i3.34FIDQtvfOE&ved=0ahUKEwid6LPqtI_lAhUlnuAKHa5hD9kQ4dUDCAs&uact=5


On Wed, Oct 9, 2019 at 10:41 AM Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> Stopwords (it was discussed on mailing list several times I recall):
> The ideas is that it used to be part of the tricks to make the index
> as small as possible to allow faster search. Stopwords being the most
> common words....
> This days, disk space is not an issue most of the time and there have
> been many optimizations to make stopwords less relevant. Plus, like
> you said, sometimes the stopword management actively gets in the way.
> Here is an interesting - if old - article about it too:
>
> https://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be
>
> Regards,
>    Alex.
>
> On Wed, 9 Oct 2019 at 09:39, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" <hastings.recurs...@gmail.com>
> wrote:
> >
> >     Another thing to add to the above,
> >     >
> >     > IT:ibm. In this case, we would want to maintain the colon and the
> >     > capitalization (otherwise “it” would be taken out as a stopword).
> >     >
> >     stopwords are a thing of the past at this point.  there is no
> benefit to
> >     using them now with hardware being so cheap.
> >
> >     On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> >     wrote:
> >
> >     > If you don't want it to be touched by a tokenizer, how would the
> >     > protection step know that the sequence of characters you want to
> >     > protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >     > protect"?
> >     >
> >     > What it sounds to me is that you may want to:
> >     > 1) copyField to a second field
> >     > 2) Apply a much lighter (whitespace?) tokenizer to that second
> field
> >     > 3) Run the results through something like KeepWordFilterFactory
> >     > 4) Search both fields with a boost on the second, higher-signal
> field
> >     >
> >     > The other option is to run CharacterFilter,
> >     > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map
> known
> >     > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> >     > term365". As long as it is done on both indexing and query, they
> will
> >     > still match. You may have to have a bunch of them or write some
> sort
> >     > of lookup map.
> >     >
> >     > Regards,
> >     >    Alex.
> >     >
> >     > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> >     > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
> >     > >
> >     > > Hi All,
> >     > >
> >     > > This is likely a rudimentary question, but I can’t seem to find a
> >     > straight-forward answer on forums or the documentation…is there a
> way to
> >     > protect tokens from ANY analysis? I know things like the
> >     > KeywordMarkerFilterFactory protect tokens from stemming, but we
> have some
> >     > terms we don’t even want our tokenizer to touch. Mostly, these are
> >     > IBM-specific acronyms, such as IT:ibm. In this case, we would want
> to
> >     > maintain the colon and the capitalization (otherwise “it” would be
> taken
> >     > out as a stopword).
> >     > >
> >     > > Any advice is appreciated!
> >     > >
> >     > > Thank you,
> >     > > Audrey
> >     > >
> >     > > --
> >     > > Audrey Lorberfeld
> >     > > Data Scientist, w3 Search
> >     > > IBM
> >     > > audrey.lorberf...@ibm.com
> >     > >
> >     >
> >
> >
>

Re: Re: Protecting Tokens from Any Analysis

Reply via email to