another add on, as the previous two were pretty much spot on: https://www.google.com/search?rlz=1C5CHFA_enUS814US819&sxsrf=ACYBGNTi2tQTQH6TycDKwRNEn9g2km9awg%3A1570632176627&ei=8PGdXa7tJeem_QaatJ_oAg&q=drive+in&oq=drive+in&gs_l=psy-ab.3..0l10.35669.36730..37042...0.4..1.434.1152.4j3j4-1......0....1..gws-wiz.......0i71j35i39j0i273j0i67j0i131j0i273i70i249.agjl1cqAyog&ved=0ahUKEwiupdfntI_lAhVnU98KHRraBy0Q4dUDCAs&uact=5
vs https://www.google.com/search?rlz=1C5CHFA_enUS814US819&sxsrf=ACYBGNRFNjzWADDR7awohPfgg8qGXqOlmg%3A1570632182338&ei=9vGdXZ2VFKW8ggeuw73IDQ&q=drive+on&oq=drive+on&gs_l=psy-ab.3..0l10.35301.37396..37917...0.4..0.83.590.8....2..0....1..gws-wiz.......0i71j35i39j0i273j0i131j0i67j0i3.34FIDQtvfOE&ved=0ahUKEwid6LPqtI_lAhUlnuAKHa5hD9kQ4dUDCAs&uact=5 On Wed, Oct 9, 2019 at 10:41 AM Alexandre Rafalovitch <arafa...@gmail.com> wrote: > Stopwords (it was discussed on mailing list several times I recall): > The ideas is that it used to be part of the tricks to make the index > as small as possible to allow faster search. Stopwords being the most > common words.... > This days, disk space is not an issue most of the time and there have > been many optimizations to make stopwords less relevant. Plus, like > you said, sometimes the stopword management actively gets in the way. > Here is an interesting - if old - article about it too: > > https://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be > > Regards, > Alex. > > On Wed, 9 Oct 2019 at 09:39, Audrey Lorberfeld - > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > > > Hey Alex, > > > > Thank you! > > > > Re: stopwords being a thing of the past due to the affordability of > hardware...can you expand? I'm not sure I understand. > > > > -- > > Audrey Lorberfeld > > Data Scientist, w3 Search > > IBM > > audrey.lorberf...@ibm.com > > > > > > On 10/8/19, 1:01 PM, "David Hastings" <hastings.recurs...@gmail.com> > wrote: > > > > Another thing to add to the above, > > > > > > IT:ibm. In this case, we would want to maintain the colon and the > > > capitalization (otherwise “it” would be taken out as a stopword). > > > > > stopwords are a thing of the past at this point. there is no > benefit to > > using them now with hardware being so cheap. > > > > On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch < > arafa...@gmail.com> > > wrote: > > > > > If you don't want it to be touched by a tokenizer, how would the > > > protection step know that the sequence of characters you want to > > > protect is "IT:ibm" and not "this is an IT:ibm term I want to > > > protect"? > > > > > > What it sounds to me is that you may want to: > > > 1) copyField to a second field > > > 2) Apply a much lighter (whitespace?) tokenizer to that second > field > > > 3) Run the results through something like KeepWordFilterFactory > > > 4) Search both fields with a boost on the second, higher-signal > field > > > > > > The other option is to run CharacterFilter, > > > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map > known > > > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm -> > > > term365". As long as it is done on both indexing and query, they > will > > > still match. You may have to have a bunch of them or write some > sort > > > of lookup map. > > > > > > Regards, > > > Alex. > > > > > > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld - > > > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > > > > > > > Hi All, > > > > > > > > This is likely a rudimentary question, but I can’t seem to find a > > > straight-forward answer on forums or the documentation…is there a > way to > > > protect tokens from ANY analysis? I know things like the > > > KeywordMarkerFilterFactory protect tokens from stemming, but we > have some > > > terms we don’t even want our tokenizer to touch. Mostly, these are > > > IBM-specific acronyms, such as IT:ibm. In this case, we would want > to > > > maintain the colon and the capitalization (otherwise “it” would be > taken > > > out as a stopword). > > > > > > > > Any advice is appreciated! > > > > > > > > Thank you, > > > > Audrey > > > > > > > > -- > > > > Audrey Lorberfeld > > > > Data Scientist, w3 Search > > > > IBM > > > > audrey.lorberf...@ibm.com > > > > > > > > > > > >