Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Guilherme Viteri Mon, 11 Nov 2019 04:24:33 -0800

Thanks
> Removing stopwords is another story. I'm curious to find the reason
> assuming that you keep on using stopwords. In some cases, stopwords are
> really necessary.
Yes. It always make sense the way we've been using.


> If q.alt is giving you responses, it's confirmed that your stopwords filter
> is working as expected. The problem definitely lies in the configuration of
> edismax.
I see.

> *Let me explain again:* In your solrconfig.xml, look at your /search
Ok, using q now, removed all qf, performed the search and I got 23 results, and 
the one I really want, on the top.
As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I 
don't get anything (which make sense). However if I query name_exact, I get the 
23 results again, and unfortunately if I query stId^1.0 name_exact^10.0 I still 
don't get any results.

In summary
- without qf - 23 results
- dbId - 0 results
- name_exact - 16 results
- name - 23 results
- dbId^1.0
  name_exact^10.0 - 0 results
- 0 results if any other, stId, dbId (key) is added on top of the 
name(name_exact, etc).

Definitely lost here! :-/


> On 11 Nov 2019, at 07:59, Paras Lehana <paras.leh...@indiamart.com> wrote:
> 
> Hi
> 
> So I don't think removing it completely is the way to go from the scenario
>> we have
> 
> 
> Removing stopwords is another story. I'm curious to find the reason
> assuming that you keep on using stopwords. In some cases, stopwords are
> really necessary.
> 
> 
> Quite a considerable increase
> 
> 
> If q.alt is giving you responses, it's confirmed that your stopwords filter
> is working as expected. The problem definitely lies in the configuration of
> edismax.
> 
> 
> 
>> I am sorry but I didn't understand what do you want me to do exactly with
>> the lst (??) and qf and bf.
> 
> 
> What combinations did you try? I was referring to the field-level boosting
> you have applied in edismax config.
> 
> *Let me explain again:* In your solrconfig.xml, look at your /search
> request handler. There are many qf and some bq boosts. I want you to remove
> all of these, check response again (with q now) and keep on adding them
> again (one by one) while looking for when the numFound drastically changes.
> 
> On Fri, 8 Nov 2019 at 23:47, David Hastings <hastings.recurs...@gmail.com>
> wrote:
> 
>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>> pretty well for such a solution, but for a full index the size became
>> prohibitive
>> 
>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <wun...@wunderwood.org>
>> wrote:
>> 
>>> If we had IDF for phrases, they would be super effective. The 2X weight
>> is
>>> a hack that mostly works.
>>> 
>>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>> hastings.recurs...@gmail.com> wrote:
>>>> 
>>>> the pf and qf fields are REALLY nice for this
>>>> 
>>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>> wun...@wunderwood.org>
>>>> wrote:
>>>> 
>>>>> I always enable phrase searching in edismax for exactly this reason.
>>>>> 
>>>>> Something like:
>>>>> 
>>>>>      <str name="qf”>title^8 keywords^4 text</str>
>>>>>      <str name="pf”>title^16 keywords^8 text^2</str>
>>>>> 
>>>>> To deal with concepts in queries, a classifier and/or named entity
>>>>> extractor can be helpful. If you have a list of concepts (“controlled
>>>>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>> that
>>>>> term can be queried against the field matching that vocabulary.
>>>>> 
>>>>> This is how LinkedIn separates people, companies, and places, for
>>> example.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <erickerick...@gmail.com
>>> 
>>>>> wrote:
>>>>>> 
>>>>>> Look at the “mm” parameter, try setting it to 100%. Although that’t
>> not
>>>>> entirely likely to do what you want either since virtually every doc
>>> will
>>>>> have “a” in it. But at least you’d get docs that have both terms.
>>>>>> 
>>>>>> you may also be able to search for things like “Lamin A” _only as a
>>>>> phrase_ and have some luck. But this is a gnarly problem in general.
>>> Some
>>>>> people have been able to substitute synonyms and/or shingles to make
>>> this
>>>>> work at the expense of a larger index.
>>>>>> 
>>>>>> This is a generic problem with context. “Lamin A” is really a
>>> “concept”,
>>>>> not just two words that happen to be near each other. Searching as a
>>> phrase
>>>>> is an OOB-but-naive way to try to make it more likely that the ranked
>>>>> results refer to the _concept_ of “Lamin A”. The assumption here is
>> “if
>>>>> these two words appear next to each other, they’re more likely to be
>>> what I
>>>>> want”. I say “naive” because “Lamins: A new approach to...” would
>>> _also_ be
>>>>> found for a naive phrase search. (I have no idea whether such a title
>>> makes
>>>>> sense or not, but you figured that out already)...
>>>>>> 
>>>>>> To do this well you’d have to dive in to NLP/Machine learning.
>>>>>> 
>>>>>> I truly wish we could have the DWIM search algorithm (Do What I
>> Mean)….
>>>>>> 
>>>>>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
>>>>> wrote:
>>>>>>> 
>>>>>>> HI Walter and Paras
>>>>>>> 
>>>>>>> I indexed it removing all the references to StopWordFilter and I
>> went
>>>>> from 121 results to near 20K as the search term q="Lymphoid and a
>>>>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
>>> So I
>>>>> don't think removing it completely is the way to go from the scenario
>> we
>>>>> have, but I appreciate the suggestion…
>>>>>>> 
>>>>>>> Yes the response is using fl=*
>>>>>>> I am trying some combinations at the moment, but yet no success.
>>>>>>> 
>>>>>>> defType=edismax
>>>>>>> q.alt=Lymphoid and a non-Lymphoid cell
>>>>>>> Number of results=1599
>>>>>>> Quite a considerable increase, even though reasonable meaningful
>>>>> results.
>>>>>>> 
>>>>>>> I am sorry but I didn't understand what do you want me to do exactly
>>>>> with the lst (??) and qf and bf.
>>>>>>> 
>>>>>>> Thanks everyone with their inputs
>>>>>>> 
>>>>>>> 
>>>>>>>> On 8 Nov 2019, at 06:45, Paras Lehana <paras.leh...@indiamart.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Guilherme
>>>>>>>> 
>>>>>>>> By accident, I ended up querying the using the default handler
>>>>> (/select) and it worked.
>>>>>>>> 
>>>>>>>> You've just found the culprit. Thanks for giving the material I
>>>>> requested. Your analysis chain is working as expected. I don't see any
>>>>> issue in either StopWordFilter or your boosts. I also use a boost of
>> 50
>>>>> when boosting contextual suggestions (boosting "gold iphone" on a page
>>> of
>>>>> iphone) but I take Walter's suggestion and would try to optimize my
>>>>> weights. I agree that this 50 thing was not researched much about by
>> us
>>> as
>>>>> well (we never faced performance or relevance issues).
>>>>>>>> 
>>>>>>>> See the major difference in both the handlers - edismax. I'm pretty
>>>>> sure that your problem lies in the parsing of queries (you can confirm
>>> that
>>>>> from parsedquery key in debug of both JSON responses). I hope you have
>>>>> provided the response with fl=*. Replace q with q.alt in your /search
>>>>> handler query and I think you should start getting responses. That's
>>>>> because q.alt uses standard parser. If you want to keep using
>> edisMax, I
>>>>> suggest you to test the responses removing some combination of lst
>> (qf,
>>> bf)
>>>>> and find what's restricting the documents to come up. I'm out of
>> office
>>>>> today - would have certainly tried analyzing the field values of the
>>>>> document in /select request and compare it with qf/bq in
>> solrconfig.xml
>>>>> /search. Do this for me and you'd certainly find something.
>>>>>>>> 
>>>>>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
>> wun...@wunderwood.org
>>>>> <mailto:wun...@wunderwood.org>> wrote:
>>>>>>>> I normally use a weight of 8 for the most important field, like
>>> title.
>>>>> Other fields might get a 4 or 2.
>>>>>>>> 
>>>>>>>> I add a “pf” field with the weights doubled, so that phrase matches
>>>>> have a higher weight.
>>>>>>>> 
>>>>>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>>>>> early web search engines. With different relevance algorithms and
>>> totally
>>>>> different evaluation and tuning systems, they settled on weights of 8
>>> and
>>>>> 7.5 for HTML titles. With the the two radically different system
>> getting
>>>>> the same number, I decided that was a property of the documents, not
>> of
>>> the
>>>>> search engines.
>>>>>>>> 
>>>>>>>> wunder
>>>>>>>> Walter Underwood
>>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>>> (my blog)
>>>>>>>> 
>>>>>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Wunder,
>>>>>>>>> 
>>>>>>>>> My indexer takes quite a few hours to be executed I am shortening
>> it
>>>>> to run faster, but I also need to make sure it gives what we are
>>> expecting.
>>>>> This implementation's been there for >4y, and massively used.
>>>>>>>>> 
>>>>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
>> extremely
>>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>> years
>>>>> of configuring Solr.
>>>>>>>>> I've inherited that implementation and I am really keen to
>> adequate
>>>>> it, what would you recommend ?
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> Guilherme
>>>>>>>>> 
>>>>>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org
>>>>> <mailto:wun...@wunderwood.org>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thanks for posting the files. Looking at schema.xml, I see that
>> you
>>>>> still are using StopFilterFactory. The first advice we gave you was to
>>>>> remove that.
>>>>>>>>>> 
>>>>>>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>>>>>>> 
>>>>>>>>>> You will continue to have problems matching stopwords until you
>> do
>>>>> that.
>>>>>>>>>> 
>>>>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
>> extremely
>>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>> years
>>>>> of configuring Solr.
>>>>>>>>>> 
>>>>>>>>>> wunder
>>>>>>>>>> Walter Underwood
>>>>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
>>> 
>>>>> (my blog)
>>>>>>>>>> 
>>>>>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Paras, everyone
>>>>>>>>>>> 
>>>>>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
>>>>> you had trouble with the attachments I will host it somewhere and
>> share
>>> the
>>>>> links.
>>>>>>>>>>> I don't tweak my index, I get the data from the graph database,
>>>>> create a document as they are and save to solr.
>>>>>>>>>>> 
>>>>>>>>>>> So, I am sending the new analysis screen querying the way you
>>>>> suggested. Also the results with params and solr query url.
>>>>>>>>>>> 
>>>>>>>>>>> During the process of querying what you asked I found something
>>>>> really weird (at least for me). By accident, I ended up querying the
>>> using
>>>>> the default handler (/select) and it worked. Then If I use the one I
>>> must
>>>>> use, then sadly doesn't work. I am posting both results and I will
>> also
>>>>> post the handlers as well.
>>>>>>>>>>> 
>>>>>>>>>>> Here is the link with all the files mentioned before
>>>>>>>>>>> 
>>>>> 
>>> 
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
>>>>> 
>>> 
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>>>>> <
>>> 
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>> <
>>> 
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>>>> 
>>>>>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
>>>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> 
>>>>>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <
>>> paras.leh...@indiamart.com
>>>>> <mailto:paras.leh...@indiamart.com>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Guilherme.
>>>>>>>>>>>> 
>>>>>>>>>>>> I am sending they analysis result and the json result as
>>> requested.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
>>>>> quality
>>>>>>>>>>>> though).
>>>>>>>>>>>> 
>>>>>>>>>>>> From the analysis screen, the analysis is working as expected.
>>> One
>>>>> of the
>>>>>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
>>> matching
>>>>>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>>>>> initially
>>>>>>>>>>>> think of is: the stopword "a" is probably present in
>>> post-analysis
>>>>> either
>>>>>>>>>>>> of query or index. Did you tweak your index time analysis after
>>>>> indexing?
>>>>>>>>>>>> 
>>>>>>>>>>>> Do two things:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>>>>>>>> "query=*"lymphoid
>>>>>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing
>>> the
>>>>> link
>>>>>>>>>>>> here.
>>>>>>>>>>>> 2. Give the same JSON output as you have sent but this time
>> with
>>>>>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>>>>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or
>> some
>>>>> such. The
>>>>>>>>>>>>> Apache server is fairly aggressive about stripping attachments
>>>>> though, so
>>>>>>>>>>>>> it’s also possible they didn’t make it through.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
>>> gvit...@ebi.ac.uk
>>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks Erick.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> First, your index and analysis chains are considerably
>>>>> different, this
>>>>>>>>>>>>> can easily be a source of problems. In particular, using two
>>>>> different
>>>>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>>> this unless
>>>>>>>>>>>>> you’re totally sure you understand the consequences.
>>>>> Additionally, your use
>>>>>>>>>>>>> of the length filter is suspicious, especially since your
>>> problem
>>>>> statement
>>>>>>>>>>>>> is about the addition of a single letter term and the min
>> length
>>>>> allowed on
>>>>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
>> the
>>>>> ’a’ is
>>>>>>>>>>>>> filtered out in both cases, but maybe you’ve found something
>> odd
>>>>> about the
>>>>>>>>>>>>> interactions.
>>>>>>>>>>>>>> I will investigate the min length and post the results later.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
>> signs
>>>>> typos?
>>>>>>>>>>>>> Used by custom code?
>>>>>>>>>>>>>> This the url in my application, not solr params. That's the
>>>>> query string.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>> likely
>>>>> that
>>>>>>>>>>>>> all the params with an equal-sign are totally ignored unless
>>> it’s
>>>>> just a
>>>>>>>>>>>>> typo.
>>>>>>>>>>>>>> This is part of the application. Species will be used later
>> on
>>>>> in solr
>>>>>>>>>>>>> to filter out the result. That's not solr. That my app params.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Third, the easiest way to see what’s happening under the
>>> covers
>>>>> is to
>>>>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>>>>> Ignore all the
>>>>>>>>>>>>> relevance calculations for the nonce, or specify
>> “&debug=query”
>>>>> to skip
>>>>>>>>>>>>> that part.
>>>>>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
>>>>> explain tag
>>>>>>>>>>>>> is present.
>>>>>>>>>>>>>> I will try the searching the way you mentioned.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thank for your inputs
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
>>>>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> First, your index and analysis chains are considerably
>>>>> different, this
>>>>>>>>>>>>> can easily be a source of problems. In particular, using two
>>>>> different
>>>>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>>> this unless
>>>>>>>>>>>>> you’re totally sure you understand the consequences.
>>>>> Additionally, your use
>>>>>>>>>>>>> of the length filter is suspicious, especially since your
>>> problem
>>>>> statement
>>>>>>>>>>>>> is about the addition of a single letter term and the min
>> length
>>>>> allowed on
>>>>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
>> the
>>>>> ’a’ is
>>>>>>>>>>>>> filtered out in both cases, but maybe you’ve found something
>> odd
>>>>> about the
>>>>>>>>>>>>> interactions.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
>> signs
>>>>> typos?
>>>>>>>>>>>>> Used by custom code?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>> likely
>>>>> that
>>>>>>>>>>>>> all the params with an equal-sign are totally ignored unless
>>> it’s
>>>>> just a
>>>>>>>>>>>>> typo.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Third, the easiest way to see what’s happening under the
>>> covers
>>>>> is to
>>>>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>>>>> Ignore all the
>>>>>>>>>>>>> relevance calculations for the nonce, or specify
>> “&debug=query”
>>>>> to skip
>>>>>>>>>>>>> that part.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do
>>> what I
>>>>>>>>>>>>> expect” is answered by looking at the “&debug=query” output
>> and
>>>>> the
>>>>>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
>>>>> sure to look
>>>>>>>>>>>>> at _both_ the query and index output. Also, and very important
>>>>> about the
>>>>>>>>>>>>> analysis page (and this is confusing) is that this _assumes_
>>> that
>>>>> what you
>>>>>>>>>>>>> put in the text boxes have made it through the query parser
>>>>> intact and is
>>>>>>>>>>>>> analyzed by the field selected. Consider the search
>>>>> "q=field:word1 word2".
>>>>>>>>>>>>> Now you type “word1 word2” into the analysis text box and it
>>>>> looks like
>>>>>>>>>>>>> what you expect. That’s misleading because the query is
>> _parsed_
>>>>> as
>>>>>>>>>>>>> "field:word1 default_search_field:word2”. This is where
>>>>> “&debug=query”
>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Erick
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>>>>> paras.leh...@indiamart.com <mailto:paras.leh...@indiamart.com>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Walter,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>>>>> Those words
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think the OP's concern is different results when adding a
>>>>> stopword. I
>>>>>>>>>>>>>>>> think he's using the filter factory correctly - the query
>>> chain
>>>>>>>>>>>>> includes
>>>>>>>>>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
>>>>> document in
>>>>>>>>>>>>>>>> result you are concerned about and post full result of
>>>>> analysis screen
>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>>> both query and index).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> No.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>>>>> Those words
>>>>>>>>>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every
>> analysis
>>>>> chain in
>>>>>>>>>>>>>>>>> schema.xml.
>>>>>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to
>> read
>>>>> the new
>>>>>>>>>>>>> config.
>>>>>>>>>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords
>> will
>>>>> not be
>>>>>>>>>>>>>>>>> removed and they will be searchable.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wunder
>>>>>>>>>>>>>>>>> Walter Underwood
>>>>>>>>>>>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>>>>>>>>>>>>> http://observer.wunderwood.org/ <
>>>>> http://observer.wunderwood.org/>  (my blog)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>>>>>>>>>> If I open up the console > analysis and perform it,
>> that's
>>>>> the final
>>>>>>>>>>>>>>>>> result.
>>>>>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt>
>> in
>>>>> the
>>>>>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
>>>>> stopwords.txt"," ")
>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks David
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>>>>>>>>>> hastings.recurs...@gmail.com <mailto:
>>> hastings.recurs...@gmail.com
>>>>>> 
>>>>>>>>>>>>>>>>> <mailto:hastings.recurs...@gmail.com <mailto:
>>>>> hastings.recurs...@gmail.com>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> no,
>>>>>>>>>>>>>>>>>>>    <filter class="solr.StopFilterFactory"
>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my
>>>>> opinion of
>>>>>>>>>>>>> course,
>>>>>>>>>>>>>>>>>>> based on your use case may be different, but i generally
>>>>> axe any
>>>>>>>>>>>>>>>>> reference
>>>>>>>>>>>>>>>>>>> to them at all
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
>>>>>>>>>>>>>>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>>>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>>>>    <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>    <filter class="solr.StopFilterFactory"
>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>>>>>>>>>> hastings.recurs...@gmail.com <mailto:
>>> hastings.recurs...@gmail.com
>>>>>> 
>>>>>>>>>>>>>>>>> <mailto:hastings.recurs...@gmail.com <mailto:
>>>>> hastings.recurs...@gmail.com>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference
>> to
>>>>> stop
>>>>>>>>>>>>> words
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it
>>> again.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>>>>>>>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
>>>>>>>>>>>>>>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I am performing a search to match a name
>> (text_field),
>>>>> however
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
>>>>> records. If i
>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> 
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>> 
>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>>>>>>>>>> <field name="name"
>>>>> type="text_field"
>>>>>>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>>>>>>>>>> required="true"
>>>>>>>>>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>>>>    <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>>>>>>    <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>>>>    <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
>>>>> StopAnalyzer
>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>>>>>>> c
>>>>>>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>> 
>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>> 
>>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> 
>>>>>>>> Paras Lehana [65871]
>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>> 
>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>> Noida, UP, IN - 201303
>>>>>>>> 
>>>>>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>>>>>>> Work: 01203916600 | Extn:  8173
>>>>>>>> 
>>>>>>>> IMPORTANT:
>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 
> 
> 
> -- 
> -- 
> Regards,
> 
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
> 
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
> 
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
> 
> -- 
> IMPORTANT: 
> NEVER share your IndiaMART OTP/ Password with anyone.

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to