oh and by 'non stop' i mean close enough for me :) On Wed, Oct 9, 2019 at 2:59 PM David Hastings <hastings.recurs...@gmail.com> wrote:
> if you have anything close to a decent server you wont notice it all. im > at about 21 million documents, index varies between 450gb to 800gb > depending on merges, and about 60k searches a day and stay sub second non > stop, and this is on a single core/non cloud environment > > On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > >> Also, in terms of computational cost, it would seem that including most >> terms/not having a stop ilst would take a toll on the system. For instance, >> right now we have "ibm" as a stop word because it appears everywhere in our >> corpus. If we did not include it in the stop words file, we would have to >> retrieve every single document in our corpus and rank them. That's a high >> computational cost, no? >> >> -- >> Audrey Lorberfeld >> Data Scientist, w3 Search >> IBM >> audrey.lorberf...@ibm.com >> >> >> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" < >> audrey.lorberf...@ibm.com> wrote: >> >> Wow, thank you so much, everyone. This is all incredibly helpful >> insight. >> >> So, would it be fair to say that the majority of you all do NOT use >> stop words? >> >> -- >> Audrey Lorberfeld >> Data Scientist, w3 Search >> IBM >> audrey.lorberf...@ibm.com >> >> >> On 10/9/19, 11:14 AM, "David Hastings" <hastings.recurs...@gmail.com> >> wrote: >> >> However, with all that said, stopwords CAN be useful in some >> situations. I >> combine stopwords with the shingle factory to create "interesting >> phrases" >> (not really) that i use in "my more like this" needs. for >> example, >> europe for vacation >> europe on vacation >> will create the shingle >> europe_vacation >> which i can then use to relate other documents that would be much >> more similar in such regard, rather than just using the >> "interesting words" >> europe, vacation >> >> with stop words, the shingles would be >> europe_for >> for_vacation >> and >> europe_on >> on_vacation >> >> just something to keep in mind, theres a lot of creative ways to >> use >> stopwords depending on your needs. i use the above for a VERY >> basic ML >> teacher and it works way better than using stopwords, >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson < >> erickerick...@gmail.com> >> wrote: >> >> > The theory behind stopwords is that they are “safe” to remove >> when >> > calculating relevance, so we can squeeze every last bit of >> usefulness out >> > of very constrained hardware (think 64K of memory. Yes >> kilobytes). We’ve >> > come a long way since then and the necessity of removing >> stopwords from the >> > indexed tokens to conserve RAM and disk is much less relevant >> than it used >> > to be in “the bad old days” when the idea of stopwords was >> invented. >> > >> > I’m not quite so confident as Alex that there is “no benefit”, >> but I’ll >> > totally agree that you should remove stopwords only _after_ you >> have some >> > evidence that removing them is A Good Thing in your situation. >> > >> > And removing stopwords leads to some interesting corner cases. >> Consider a >> > search for “to be or not to be” if they’re all stopwords. >> > >> > Best, >> > Erick >> > >> > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - >> > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: >> > > >> > > Hey Alex, >> > > >> > > Thank you! >> > > >> > > Re: stopwords being a thing of the past due to the >> affordability of >> > hardware...can you expand? I'm not sure I understand. >> > > >> > > -- >> > > Audrey Lorberfeld >> > > Data Scientist, w3 Search >> > > IBM >> > > audrey.lorberf...@ibm.com >> > > >> > > >> > > On 10/8/19, 1:01 PM, "David Hastings" < >> hastings.recurs...@gmail.com> >> > wrote: >> > > >> > > Another thing to add to the above, >> > >> >> > >> IT:ibm. In this case, we would want to maintain the colon >> and the >> > >> capitalization (otherwise “it” would be taken out as a >> stopword). >> > >> >> > > stopwords are a thing of the past at this point. there is >> no benefit >> > to >> > > using them now with hardware being so cheap. >> > > >> > > On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch < >> > arafa...@gmail.com> >> > > wrote: >> > > >> > >> If you don't want it to be touched by a tokenizer, how would >> the >> > >> protection step know that the sequence of characters you >> want to >> > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to >> > >> protect"? >> > >> >> > >> What it sounds to me is that you may want to: >> > >> 1) copyField to a second field >> > >> 2) Apply a much lighter (whitespace?) tokenizer to that >> second field >> > >> 3) Run the results through something like >> KeepWordFilterFactory >> > >> 4) Search both fields with a boost on the second, >> higher-signal field >> > >> >> > >> The other option is to run CharacterFilter, >> > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to >> map known >> > >> complex acronyms to non-tokenizable substitutions. E.g. >> "IT:ibm -> >> > >> term365". As long as it is done on both indexing and query, >> they will >> > >> still match. You may have to have a bunch of them or write >> some sort >> > >> of lookup map. >> > >> >> > >> Regards, >> > >> Alex. >> > >> >> > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld - >> > >> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: >> > >>> >> > >>> Hi All, >> > >>> >> > >>> This is likely a rudimentary question, but I can’t seem to >> find a >> > >> straight-forward answer on forums or the documentation…is >> there a way to >> > >> protect tokens from ANY analysis? I know things like the >> > >> KeywordMarkerFilterFactory protect tokens from stemming, but >> we have >> > some >> > >> terms we don’t even want our tokenizer to touch. Mostly, >> these are >> > >> IBM-specific acronyms, such as IT:ibm. In this case, we >> would want to >> > >> maintain the colon and the capitalization (otherwise “it” >> would be taken >> > >> out as a stopword). >> > >>> >> > >>> Any advice is appreciated! >> > >>> >> > >>> Thank you, >> > >>> Audrey >> > >>> >> > >>> -- >> > >>> Audrey Lorberfeld >> > >>> Data Scientist, w3 Search >> > >>> IBM >> > >>> audrey.lorberf...@ibm.com >> > >>> >> > >> >> > > >> > > >> > >> > >> >> >> >> >>