Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
yup. youre going to find solr is WAY more efficient than you think when it comes to complex queries. On Wed, Oct 9, 2019 at 3:17 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > True...I guess another rub here is that we're using the edismax parser, so > all of our queries are inherent

Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
True...I guess another rub here is that we're using the edismax parser, so all of our queries are inherently OR queries. So for a query like 'the ibm way', the search engine would have to: 1) retrieve a document list for: --> "ibm" (this list is probably 80% of the documents) --> "the" (th

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
if you have anything close to a decent server you wont notice it all. im at about 21 million documents, index varies between 450gb to 800gb depending on merges, and about 60k searches a day and stay sub second non stop, and this is on a single core/non cloud environment On Wed, Oct 9, 2019 at 2:5

Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
only in my more like this tools, but they have a very specific purpose, otherwise no On Wed, Oct 9, 2019 at 2:31 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Wow, thank you so much, everyone. This is all incredibly helpful insight. > > So, would it be fair to say that the majority o

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
oh and by 'non stop' i mean close enough for me :) On Wed, Oct 9, 2019 at 2:59 PM David Hastings wrote: > if you have anything close to a decent server you wont notice it all. im > at about 21 million documents, index varies between 450gb to 800gb > depending on merges, and about 60k searches a

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Also, in terms of computational cost, it would seem that including most terms/not having a stop ilst would take a toll on the system. For instance, right now we have "ibm" as a stop word because it appears everywhere in our corpus. If we did not include it in the stop words file, we would have t

Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Wow, thank you so much, everyone. This is all incredibly helpful insight. So, would it be fair to say that the majority of you all do NOT use stop words? -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com On 10/9/19, 11:14 AM, "David Hastings" wrote: However,