Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Paras Lehana Sun, 10 Nov 2019 23:59:52 -0800

Hi

So I don't think removing it completely is the way to go from the scenario
> we have



Removing stopwords is another story. I'm curious to find the reason
assuming that you keep on using stopwords. In some cases, stopwords are
really necessary.


Quite a considerable increase


If q.alt is giving you responses, it's confirmed that your stopwords filter
is working as expected. The problem definitely lies in the configuration of
edismax.



> I am sorry but I didn't understand what do you want me to do exactly with
> the lst (??) and qf and bf.


What combinations did you try? I was referring to the field-level boosting
you have applied in edismax config.

*Let me explain again:* In your solrconfig.xml, look at your /search
request handler. There are many qf and some bq boosts. I want you to remove
all of these, check response again (with q now) and keep on adding them
again (one by one) while looking for when the numFound drastically changes.

On Fri, 8 Nov 2019 at 23:47, David Hastings <hastings.recurs...@gmail.com>
wrote:

> I use 3 word shingles with stopwords for my MLT ML trainer that worked
> pretty well for such a solution, but for a full index the size became
> prohibitive
>
> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <wun...@wunderwood.org>
> wrote:
>
> > If we had IDF for phrases, they would be super effective. The 2X weight
> is
> > a hack that mostly works.
> >
> > Infoseek had phrase IDF and it was a killer algorithm for relevance.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Nov 8, 2019, at 11:08 AM, David Hastings <
> > hastings.recurs...@gmail.com> wrote:
> > >
> > > the pf and qf fields are REALLY nice for this
> > >
> > > On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
> wun...@wunderwood.org>
> > > wrote:
> > >
> > >> I always enable phrase searching in edismax for exactly this reason.
> > >>
> > >> Something like:
> > >>
> > >>       <str name="qf”>title^8 keywords^4 text</str>
> > >>       <str name="pf”>title^16 keywords^8 text^2</str>
> > >>
> > >> To deal with concepts in queries, a classifier and/or named entity
> > >> extractor can be helpful. If you have a list of concepts (“controlled
> > >> vocabulary”) that includes “Lamin A”, and that shows up in a query,
> that
> > >> term can be queried against the field matching that vocabulary.
> > >>
> > >> This is how LinkedIn separates people, companies, and places, for
> > example.
> > >>
> > >> wunder
> > >> Walter Underwood
> > >> wun...@wunderwood.org
> > >> http://observer.wunderwood.org/  (my blog)
> > >>
> > >>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <erickerick...@gmail.com
> >
> > >> wrote:
> > >>>
> > >>> Look at the “mm” parameter, try setting it to 100%. Although that’t
> not
> > >> entirely likely to do what you want either since virtually every doc
> > will
> > >> have “a” in it. But at least you’d get docs that have both terms.
> > >>>
> > >>> you may also be able to search for things like “Lamin A” _only as a
> > >> phrase_ and have some luck. But this is a gnarly problem in general.
> > Some
> > >> people have been able to substitute synonyms and/or shingles to make
> > this
> > >> work at the expense of a larger index.
> > >>>
> > >>> This is a generic problem with context. “Lamin A” is really a
> > “concept”,
> > >> not just two words that happen to be near each other. Searching as a
> > phrase
> > >> is an OOB-but-naive way to try to make it more likely that the ranked
> > >> results refer to the _concept_ of “Lamin A”. The assumption here is
> “if
> > >> these two words appear next to each other, they’re more likely to be
> > what I
> > >> want”. I say “naive” because “Lamins: A new approach to...” would
> > _also_ be
> > >> found for a naive phrase search. (I have no idea whether such a title
> > makes
> > >> sense or not, but you figured that out already)...
> > >>>
> > >>> To do this well you’d have to dive in to NLP/Machine learning.
> > >>>
> > >>> I truly wish we could have the DWIM search algorithm (Do What I
> Mean)….
> > >>>
> > >>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
> > >> wrote:
> > >>>>
> > >>>> HI Walter and Paras
> > >>>>
> > >>>> I indexed it removing all the references to StopWordFilter and I
> went
> > >> from 121 results to near 20K as the search term q="Lymphoid and a
> > >> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
> > So I
> > >> don't think removing it completely is the way to go from the scenario
> we
> > >> have, but I appreciate the suggestion…
> > >>>>
> > >>>> Yes the response is using fl=*
> > >>>> I am trying some combinations at the moment, but yet no success.
> > >>>>
> > >>>> defType=edismax
> > >>>> q.alt=Lymphoid and a non-Lymphoid cell
> > >>>> Number of results=1599
> > >>>> Quite a considerable increase, even though reasonable meaningful
> > >> results.
> > >>>>
> > >>>> I am sorry but I didn't understand what do you want me to do exactly
> > >> with the lst (??) and qf and bf.
> > >>>>
> > >>>> Thanks everyone with their inputs
> > >>>>
> > >>>>
> > >>>>> On 8 Nov 2019, at 06:45, Paras Lehana <paras.leh...@indiamart.com>
> > >> wrote:
> > >>>>>
> > >>>>> Hi Guilherme
> > >>>>>
> > >>>>> By accident, I ended up querying the using the default handler
> > >> (/select) and it worked.
> > >>>>>
> > >>>>> You've just found the culprit. Thanks for giving the material I
> > >> requested. Your analysis chain is working as expected. I don't see any
> > >> issue in either StopWordFilter or your boosts. I also use a boost of
> 50
> > >> when boosting contextual suggestions (boosting "gold iphone" on a page
> > of
> > >> iphone) but I take Walter's suggestion and would try to optimize my
> > >> weights. I agree that this 50 thing was not researched much about by
> us
> > as
> > >> well (we never faced performance or relevance issues).
> > >>>>>
> > >>>>> See the major difference in both the handlers - edismax. I'm pretty
> > >> sure that your problem lies in the parsing of queries (you can confirm
> > that
> > >> from parsedquery key in debug of both JSON responses). I hope you have
> > >> provided the response with fl=*. Replace q with q.alt in your /search
> > >> handler query and I think you should start getting responses. That's
> > >> because q.alt uses standard parser. If you want to keep using
> edisMax, I
> > >> suggest you to test the responses removing some combination of lst
> (qf,
> > bf)
> > >> and find what's restricting the documents to come up. I'm out of
> office
> > >> today - would have certainly tried analyzing the field values of the
> > >> document in /select request and compare it with qf/bq in
> solrconfig.xml
> > >> /search. Do this for me and you'd certainly find something.
> > >>>>>
> > >>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
> wun...@wunderwood.org
> > >> <mailto:wun...@wunderwood.org>> wrote:
> > >>>>> I normally use a weight of 8 for the most important field, like
> > title.
> > >> Other fields might get a 4 or 2.
> > >>>>>
> > >>>>> I add a “pf” field with the weights doubled, so that phrase matches
> > >> have a higher weight.
> > >>>>>
> > >>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
> > >> early web search engines. With different relevance algorithms and
> > totally
> > >> different evaluation and tuning systems, they settled on weights of 8
> > and
> > >> 7.5 for HTML titles. With the the two radically different system
> getting
> > >> the same number, I decided that was a property of the documents, not
> of
> > the
> > >> search engines.
> > >>>>>
> > >>>>> wunder
> > >>>>> Walter Underwood
> > >>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> > >>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> > >> (my blog)
> > >>>>>
> > >>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk
> > >> <mailto:gvit...@ebi.ac.uk>> wrote:
> > >>>>>>
> > >>>>>> Hi Wunder,
> > >>>>>>
> > >>>>>> My indexer takes quite a few hours to be executed I am shortening
> it
> > >> to run faster, but I also need to make sure it gives what we are
> > expecting.
> > >> This implementation's been there for >4y, and massively used.
> > >>>>>>
> > >>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
> extremely
> > >> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
> > years
> > >> of configuring Solr.
> > >>>>>> I've inherited that implementation and I am really keen to
> adequate
> > >> it, what would you recommend ?
> > >>>>>>
> > >>>>>> Cheers
> > >>>>>> Guilherme
> > >>>>>>
> > >>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org
> > >> <mailto:wun...@wunderwood.org>> wrote:
> > >>>>>>>
> > >>>>>>> Thanks for posting the files. Looking at schema.xml, I see that
> you
> > >> still are using StopFilterFactory. The first advice we gave you was to
> > >> remove that.
> > >>>>>>>
> > >>>>>>> Remove StopFilterFactory everywhere and reindex.
> > >>>>>>>
> > >>>>>>> You will continue to have problems matching stopwords until you
> do
> > >> that.
> > >>>>>>>
> > >>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
> extremely
> > >> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
> > years
> > >> of configuring Solr.
> > >>>>>>>
> > >>>>>>> wunder
> > >>>>>>> Walter Underwood
> > >>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> > >>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
> >
> > >> (my blog)
> > >>>>>>>
> > >>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk
> > >> <mailto:gvit...@ebi.ac.uk>> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi Paras, everyone
> > >>>>>>>>
> > >>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
> > >> you had trouble with the attachments I will host it somewhere and
> share
> > the
> > >> links.
> > >>>>>>>> I don't tweak my index, I get the data from the graph database,
> > >> create a document as they are and save to solr.
> > >>>>>>>>
> > >>>>>>>> So, I am sending the new analysis screen querying the way you
> > >> suggested. Also the results with params and solr query url.
> > >>>>>>>>
> > >>>>>>>> During the process of querying what you asked I found something
> > >> really weird (at least for me). By accident, I ended up querying the
> > using
> > >> the default handler (/select) and it worked. Then If I use the one I
> > must
> > >> use, then sadly doesn't work. I am posting both results and I will
> also
> > >> post the handlers as well.
> > >>>>>>>>
> > >>>>>>>> Here is the link with all the files mentioned before
> > >>>>>>>>
> > >>
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
> > >>
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
> > >> <
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> > >> <
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> > >>>>
> > >>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
> > >> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> > >>>>>>>>
> > >>>>>>>> Thanks
> > >>>>>>>>
> > >>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <
> > paras.leh...@indiamart.com
> > >> <mailto:paras.leh...@indiamart.com>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Guilherme.
> > >>>>>>>>>
> > >>>>>>>>> I am sending they analysis result and the json result as
> > requested.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
> > >> quality
> > >>>>>>>>> though).
> > >>>>>>>>>
> > >>>>>>>>> From the analysis screen, the analysis is working as expected.
> > One
> > >> of the
> > >>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
> > matching
> > >>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
> > >> initially
> > >>>>>>>>> think of is: the stopword "a" is probably present in
> > post-analysis
> > >> either
> > >>>>>>>>> of query or index. Did you tweak your index time analysis after
> > >> indexing?
> > >>>>>>>>>
> > >>>>>>>>> Do two things:
> > >>>>>>>>>
> > >>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
> > >>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
> > >>>>>>>>> "query=*"lymphoid
> > >>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing
> > the
> > >> link
> > >>>>>>>>> here.
> > >>>>>>>>> 2. Give the same JSON output as you have sent but this time
> with
> > >>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
> > >> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or
> some
> > >> such. The
> > >>>>>>>>>> Apache server is fairly aggressive about stripping attachments
> > >> though, so
> > >>>>>>>>>> it’s also possible they didn’t make it through.
> > >>>>>>>>>>
> > >>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
> > gvit...@ebi.ac.uk
> > >> <mailto:gvit...@ebi.ac.uk>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks Erick.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> First, your index and analysis chains are considerably
> > >> different, this
> > >>>>>>>>>> can easily be a source of problems. In particular, using two
> > >> different
> > >>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> > >> this unless
> > >>>>>>>>>> you’re totally sure you understand the consequences.
> > >> Additionally, your use
> > >>>>>>>>>> of the length filter is suspicious, especially since your
> > problem
> > >> statement
> > >>>>>>>>>> is about the addition of a single letter term and the min
> length
> > >> allowed on
> > >>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
> the
> > >> ’a’ is
> > >>>>>>>>>> filtered out in both cases, but maybe you’ve found something
> odd
> > >> about the
> > >>>>>>>>>> interactions.
> > >>>>>>>>>>> I will investigate the min length and post the results later.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
> signs
> > >> typos?
> > >>>>>>>>>> Used by custom code?
> > >>>>>>>>>>> This the url in my application, not solr params. That's the
> > >> query string.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
> > likely
> > >> that
> > >>>>>>>>>> all the params with an equal-sign are totally ignored unless
> > it’s
> > >> just a
> > >>>>>>>>>> typo.
> > >>>>>>>>>>> This is part of the application. Species will be used later
> on
> > >> in solr
> > >>>>>>>>>> to filter out the result. That's not solr. That my app params.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Third, the easiest way to see what’s happening under the
> > covers
> > >> is to
> > >>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
> > >> Ignore all the
> > >>>>>>>>>> relevance calculations for the nonce, or specify
> “&debug=query”
> > >> to skip
> > >>>>>>>>>> that part.
> > >>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
> > >> explain tag
> > >>>>>>>>>> is present.
> > >>>>>>>>>>> I will try the searching the way you mentioned.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thank for your inputs
> > >>>>>>>>>>>
> > >>>>>>>>>>> Guilherme
> > >>>>>>>>>>>
> > >>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
> > >> erickerick...@gmail.com <mailto:erickerick...@gmail.com>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Fwd to another server
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> First, your index and analysis chains are considerably
> > >> different, this
> > >>>>>>>>>> can easily be a source of problems. In particular, using two
> > >> different
> > >>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> > >> this unless
> > >>>>>>>>>> you’re totally sure you understand the consequences.
> > >> Additionally, your use
> > >>>>>>>>>> of the length filter is suspicious, especially since your
> > problem
> > >> statement
> > >>>>>>>>>> is about the addition of a single letter term and the min
> length
> > >> allowed on
> > >>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
> the
> > >> ’a’ is
> > >>>>>>>>>> filtered out in both cases, but maybe you’ve found something
> odd
> > >> about the
> > >>>>>>>>>> interactions.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
> signs
> > >> typos?
> > >>>>>>>>>> Used by custom code?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
> > likely
> > >> that
> > >>>>>>>>>> all the params with an equal-sign are totally ignored unless
> > it’s
> > >> just a
> > >>>>>>>>>> typo.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Third, the easiest way to see what’s happening under the
> > covers
> > >> is to
> > >>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
> > >> Ignore all the
> > >>>>>>>>>> relevance calculations for the nonce, or specify
> “&debug=query”
> > >> to skip
> > >>>>>>>>>> that part.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do
> > what I
> > >>>>>>>>>> expect” is answered by looking at the “&debug=query” output
> and
> > >> the
> > >>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
> > >> sure to look
> > >>>>>>>>>> at _both_ the query and index output. Also, and very important
> > >> about the
> > >>>>>>>>>> analysis page (and this is confusing) is that this _assumes_
> > that
> > >> what you
> > >>>>>>>>>> put in the text boxes have made it through the query parser
> > >> intact and is
> > >>>>>>>>>> analyzed by the field selected. Consider the search
> > >> "q=field:word1 word2".
> > >>>>>>>>>> Now you type “word1 word2” into the analysis text box and it
> > >> looks like
> > >>>>>>>>>> what you expect. That’s misleading because the query is
> _parsed_
> > >> as
> > >>>>>>>>>> "field:word1 default_search_field:word2”. This is where
> > >> “&debug=query”
> > >>>>>>>>>> helps.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Erick
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
> > >> paras.leh...@indiamart.com <mailto:paras.leh...@indiamart.com>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Walter,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
> > >> Those words
> > >>>>>>>>>> will
> > >>>>>>>>>>>>>> not be in the index, so they can never match a query.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I think the OP's concern is different results when adding a
> > >> stopword. I
> > >>>>>>>>>>>>> think he's using the filter factory correctly - the query
> > chain
> > >>>>>>>>>> includes
> > >>>>>>>>>>>>> the filter as well so it should remove "a" while querying.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
> > >> document in
> > >>>>>>>>>>>>> result you are concerned about and post full result of
> > >> analysis screen
> > >>>>>>>>>> (for
> > >>>>>>>>>>>>> both query and index).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
> > >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> No.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
> > >> Those words
> > >>>>>>>>>>>>>> will not be in the index, so they can never match a query.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every
> analysis
> > >> chain in
> > >>>>>>>>>>>>>> schema.xml.
> > >>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to
> read
> > >> the new
> > >>>>>>>>>> config.
> > >>>>>>>>>>>>>> 3. Reindex all of the documents.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords
> will
> > >> not be
> > >>>>>>>>>>>>>> removed and they will be searchable.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> wunder
> > >>>>>>>>>>>>>> Walter Underwood
> > >>>>>>>>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> > >>>>>>>>>>>>>> http://observer.wunderwood.org/ <
> > >> http://observer.wunderwood.org/>  (my blog)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
> > >> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Ok. I am kind a lost now.
> > >>>>>>>>>>>>>>> If I open up the console > analysis and perform it,
> that's
> > >> the final
> > >>>>>>>>>>>>>> result.
> > >>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt>
> in
> > >> the
> > >>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
> > >> stopwords.txt"," ")
> > >>>>>>>>>> then
> > >>>>>>>>>>>>>> add to solr. Is that correct ?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks David
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
> > >>>>>>>>>> hastings.recurs...@gmail.com <mailto:
> > hastings.recurs...@gmail.com
> > >>>
> > >>>>>>>>>>>>>> <mailto:hastings.recurs...@gmail.com <mailto:
> > >> hastings.recurs...@gmail.com>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Fwd to another server
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> no,
> > >>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> > >> ignoreCase="true"
> > >>>>>>>>>>>>>>>> words="stopwords.txt"/>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my
> > >> opinion of
> > >>>>>>>>>> course,
> > >>>>>>>>>>>>>>>> based on your use case may be different, but i generally
> > >> axe any
> > >>>>>>>>>>>>>> reference
> > >>>>>>>>>>>>>>>> to them at all
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
> > >> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
> > >>>>>>>>>>>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
> > wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks.
> > >>>>>>>>>>>>>>>>> Haven't I done this here ?
> > >>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> > >>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> > >>>>>>>>>>>>>>>>> <analyzer type="index">
> > >>>>>>>>>>>>>>>>>     <tokenizer class="solr.StandardTokenizerFactory"/>
> > >>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
> > >>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> > >>>>>>>>>>>>>> max="20"/>
> > >>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> > >>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> > >> ignoreCase="true"
> > >>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> > >>>>>>>>>>>>>>>>> </analyzer>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
> > >>>>>>>>>> hastings.recurs...@gmail.com <mailto:
> > hastings.recurs...@gmail.com
> > >>>
> > >>>>>>>>>>>>>> <mailto:hastings.recurs...@gmail.com <mailto:
> > >> hastings.recurs...@gmail.com>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Fwd to another server
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference
> to
> > >> stop
> > >>>>>>>>>> words
> > >>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it
> > again.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
> > >>>>>>>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
> > >>>>>>>>>>>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I am performing a search to match a name
> (text_field),
> > >> however
> > >>>>>>>>>> this
> > >>>>>>>>>>>>>> term
> > >>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
> > >> records. If i
> > >>>>>>>>>> remove
> > >>>>>>>>>>>>>>>>> 'a'
> > >>>>>>>>>>>>>>>>>>> then it works.
> > >>>>>>>>>>>>>>>>>>> e.g
> > >>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
> > >>>>>>>>>>>>>>>>>>> doesn't work:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
> > >>>>>>>>>>>>>>>>>>> works:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> interested in the first result
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> schema.xml
> > >>>>>>>>>>>>>>>>>>> <field name="name"
> > >> type="text_field"
> > >>>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
> > >>>>>>>>>> required="true"
> > >>>>>>>>>>>>>>>>>>> multiValued="false"/>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> <analyzer type="query">
> > >>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> > >>>>>>>>>>>>>>>>> max="20"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> > >>>>>>>>>>>>>> ignoreCase="true"
> > >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> > >>>>>>>>>>>>>>>>>>> </analyzer>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> > >>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> > >>>>>>>>>>>>>>>>>>> <analyzer type="index">
> > >>>>>>>>>>>>>>>>>>>     <tokenizer
> class="solr.StandardTokenizerFactory"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> > >>>>>>>>>>>>>>>>> max="20"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> > >>>>>>>>>>>>>> ignoreCase="true"
> > >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> > >>>>>>>>>>>>>>>>>>> </analyzer>
> > >>>>>>>>>>>>>>>>>>> <analyzer type="query">
> > >>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> > >>>>>>>>>>>>>>>>> max="20"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> > >>>>>>>>>>>>>> ignoreCase="true"
> > >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> > >>>>>>>>>>>>>>>>>>> </analyzer>
> > >>>>>>>>>>>>>>>>>>> </fieldType>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> stopwords.txt
> > >>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
> > >> StopAnalyzer
> > >>>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>> b
> > >>>>>>>>>>>>>>>>>>> c
> > >>>>>>>>>>>>>>>>>>> ....
> > >>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks
> > >>>>>>>>>>>>>>>>>>> Guilherme
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *Paras Lehana* [65871]
> > >>>>>>>>>>>>> Development Engineer, Auto-Suggest,
> > >>>>>>>>>>>>> IndiaMART Intermesh Ltd.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > >>>>>>>>>>>>> Noida, UP, IN - 201303
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Mob.: +91-9560911996
> > >>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> IMPORTANT:
> > >>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> --
> > >>>>>>>>> Regards,
> > >>>>>>>>>
> > >>>>>>>>> *Paras Lehana* [65871]
> > >>>>>>>>> Development Engineer, Auto-Suggest,
> > >>>>>>>>> IndiaMART Intermesh Ltd.
> > >>>>>>>>>
> > >>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > >>>>>>>>> Noida, UP, IN - 201303
> > >>>>>>>>>
> > >>>>>>>>> Mob.: +91-9560911996
> > >>>>>>>>> Work: 01203916600 | Extn:  *8173*
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> IMPORTANT:
> > >>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> --
> > >>>>> Regards,
> > >>>>>
> > >>>>> Paras Lehana [65871]
> > >>>>> Development Engineer, Auto-Suggest,
> > >>>>> IndiaMART Intermesh Ltd.
> > >>>>>
> > >>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > >>>>> Noida, UP, IN - 201303
> > >>>>>
> > >>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
> > >>>>> Work: 01203916600 | Extn:  8173
> > >>>>>
> > >>>>> IMPORTANT:
> > >>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> > >>>
> > >>
> > >>
> >
> >
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to