What I can't understand is: I search for the exact term - "Immunoregulatory interactions between a Lymphoid and a non-Lymphoid cell" and If i search "I search for the exact term - Immunoregulatory interactions between a Lymphoid and non-Lymphoid cell" then it works
> On 11 Nov 2019, at 12:24, Guilherme Viteri <gvit...@ebi.ac.uk> wrote: > > Thanks >> Removing stopwords is another story. I'm curious to find the reason >> assuming that you keep on using stopwords. In some cases, stopwords are >> really necessary. > Yes. It always make sense the way we've been using. > >> If q.alt is giving you responses, it's confirmed that your stopwords filter >> is working as expected. The problem definitely lies in the configuration of >> edismax. > I see. > >> *Let me explain again:* In your solrconfig.xml, look at your /search > Ok, using q now, removed all qf, performed the search and I got 23 results, > and the one I really want, on the top. > As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I > don't get anything (which make sense). However if I query name_exact, I get > the 23 results again, and unfortunately if I query stId^1.0 name_exact^10.0 I > still don't get any results. > > In summary > - without qf - 23 results > - dbId - 0 results > - name_exact - 16 results > - name - 23 results > - dbId^1.0 > name_exact^10.0 - 0 results > - 0 results if any other, stId, dbId (key) is added on top of the > name(name_exact, etc). > > Definitely lost here! :-/ > > >> On 11 Nov 2019, at 07:59, Paras Lehana <paras.leh...@indiamart.com> wrote: >> >> Hi >> >> So I don't think removing it completely is the way to go from the scenario >>> we have >> >> >> Removing stopwords is another story. I'm curious to find the reason >> assuming that you keep on using stopwords. In some cases, stopwords are >> really necessary. >> >> >> Quite a considerable increase >> >> >> If q.alt is giving you responses, it's confirmed that your stopwords filter >> is working as expected. The problem definitely lies in the configuration of >> edismax. >> >> >> >>> I am sorry but I didn't understand what do you want me to do exactly with >>> the lst (??) and qf and bf. >> >> >> What combinations did you try? I was referring to the field-level boosting >> you have applied in edismax config. >> >> *Let me explain again:* In your solrconfig.xml, look at your /search >> request handler. There are many qf and some bq boosts. I want you to remove >> all of these, check response again (with q now) and keep on adding them >> again (one by one) while looking for when the numFound drastically changes. >> >> On Fri, 8 Nov 2019 at 23:47, David Hastings <hastings.recurs...@gmail.com> >> wrote: >> >>> I use 3 word shingles with stopwords for my MLT ML trainer that worked >>> pretty well for such a solution, but for a full index the size became >>> prohibitive >>> >>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <wun...@wunderwood.org> >>> wrote: >>> >>>> If we had IDF for phrases, they would be super effective. The 2X weight >>> is >>>> a hack that mostly works. >>>> >>>> Infoseek had phrase IDF and it was a killer algorithm for relevance. >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>> >>>>> On Nov 8, 2019, at 11:08 AM, David Hastings < >>>> hastings.recurs...@gmail.com> wrote: >>>>> >>>>> the pf and qf fields are REALLY nice for this >>>>> >>>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood < >>> wun...@wunderwood.org> >>>>> wrote: >>>>> >>>>>> I always enable phrase searching in edismax for exactly this reason. >>>>>> >>>>>> Something like: >>>>>> >>>>>> <str name="qf”>title^8 keywords^4 text</str> >>>>>> <str name="pf”>title^16 keywords^8 text^2</str> >>>>>> >>>>>> To deal with concepts in queries, a classifier and/or named entity >>>>>> extractor can be helpful. If you have a list of concepts (“controlled >>>>>> vocabulary”) that includes “Lamin A”, and that shows up in a query, >>> that >>>>>> term can be queried against the field matching that vocabulary. >>>>>> >>>>>> This is how LinkedIn separates people, companies, and places, for >>>> example. >>>>>> >>>>>> wunder >>>>>> Walter Underwood >>>>>> wun...@wunderwood.org >>>>>> http://observer.wunderwood.org/ (my blog) >>>>>> >>>>>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <erickerick...@gmail.com >>>> >>>>>> wrote: >>>>>>> >>>>>>> Look at the “mm” parameter, try setting it to 100%. Although that’t >>> not >>>>>> entirely likely to do what you want either since virtually every doc >>>> will >>>>>> have “a” in it. But at least you’d get docs that have both terms. >>>>>>> >>>>>>> you may also be able to search for things like “Lamin A” _only as a >>>>>> phrase_ and have some luck. But this is a gnarly problem in general. >>>> Some >>>>>> people have been able to substitute synonyms and/or shingles to make >>>> this >>>>>> work at the expense of a larger index. >>>>>>> >>>>>>> This is a generic problem with context. “Lamin A” is really a >>>> “concept”, >>>>>> not just two words that happen to be near each other. Searching as a >>>> phrase >>>>>> is an OOB-but-naive way to try to make it more likely that the ranked >>>>>> results refer to the _concept_ of “Lamin A”. The assumption here is >>> “if >>>>>> these two words appear next to each other, they’re more likely to be >>>> what I >>>>>> want”. I say “naive” because “Lamins: A new approach to...” would >>>> _also_ be >>>>>> found for a naive phrase search. (I have no idea whether such a title >>>> makes >>>>>> sense or not, but you figured that out already)... >>>>>>> >>>>>>> To do this well you’d have to dive in to NLP/Machine learning. >>>>>>> >>>>>>> I truly wish we could have the DWIM search algorithm (Do What I >>> Mean)…. >>>>>>> >>>>>>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <gvit...@ebi.ac.uk> >>>>>> wrote: >>>>>>>> >>>>>>>> HI Walter and Paras >>>>>>>> >>>>>>>> I indexed it removing all the references to StopWordFilter and I >>> went >>>>>> from 121 results to near 20K as the search term q="Lymphoid and a >>>>>> non-Lymphoid cell" is matching entities such as "IFT A" or "Lamin A". >>>> So I >>>>>> don't think removing it completely is the way to go from the scenario >>> we >>>>>> have, but I appreciate the suggestion… >>>>>>>> >>>>>>>> Yes the response is using fl=* >>>>>>>> I am trying some combinations at the moment, but yet no success. >>>>>>>> >>>>>>>> defType=edismax >>>>>>>> q.alt=Lymphoid and a non-Lymphoid cell >>>>>>>> Number of results=1599 >>>>>>>> Quite a considerable increase, even though reasonable meaningful >>>>>> results. >>>>>>>> >>>>>>>> I am sorry but I didn't understand what do you want me to do exactly >>>>>> with the lst (??) and qf and bf. >>>>>>>> >>>>>>>> Thanks everyone with their inputs >>>>>>>> >>>>>>>> >>>>>>>>> On 8 Nov 2019, at 06:45, Paras Lehana <paras.leh...@indiamart.com> >>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Guilherme >>>>>>>>> >>>>>>>>> By accident, I ended up querying the using the default handler >>>>>> (/select) and it worked. >>>>>>>>> >>>>>>>>> You've just found the culprit. Thanks for giving the material I >>>>>> requested. Your analysis chain is working as expected. I don't see any >>>>>> issue in either StopWordFilter or your boosts. I also use a boost of >>> 50 >>>>>> when boosting contextual suggestions (boosting "gold iphone" on a page >>>> of >>>>>> iphone) but I take Walter's suggestion and would try to optimize my >>>>>> weights. I agree that this 50 thing was not researched much about by >>> us >>>> as >>>>>> well (we never faced performance or relevance issues). >>>>>>>>> >>>>>>>>> See the major difference in both the handlers - edismax. I'm pretty >>>>>> sure that your problem lies in the parsing of queries (you can confirm >>>> that >>>>>> from parsedquery key in debug of both JSON responses). I hope you have >>>>>> provided the response with fl=*. Replace q with q.alt in your /search >>>>>> handler query and I think you should start getting responses. That's >>>>>> because q.alt uses standard parser. If you want to keep using >>> edisMax, I >>>>>> suggest you to test the responses removing some combination of lst >>> (qf, >>>> bf) >>>>>> and find what's restricting the documents to come up. I'm out of >>> office >>>>>> today - would have certainly tried analyzing the field values of the >>>>>> document in /select request and compare it with qf/bq in >>> solrconfig.xml >>>>>> /search. Do this for me and you'd certainly find something. >>>>>>>>> >>>>>>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood < >>> wun...@wunderwood.org >>>>>> <mailto:wun...@wunderwood.org>> wrote: >>>>>>>>> I normally use a weight of 8 for the most important field, like >>>> title. >>>>>> Other fields might get a 4 or 2. >>>>>>>>> >>>>>>>>> I add a “pf” field with the weights doubled, so that phrase matches >>>>>> have a higher weight. >>>>>>>>> >>>>>>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two >>>>>> early web search engines. With different relevance algorithms and >>>> totally >>>>>> different evaluation and tuning systems, they settled on weights of 8 >>>> and >>>>>> 7.5 for HTML titles. With the the two radically different system >>> getting >>>>>> the same number, I decided that was a property of the documents, not >>> of >>>> the >>>>>> search engines. >>>>>>>>> >>>>>>>>> wunder >>>>>>>>> Walter Underwood >>>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> >>>>>> (my blog) >>>>>>>>> >>>>>>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk >>>>>> <mailto:gvit...@ebi.ac.uk>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Wunder, >>>>>>>>>> >>>>>>>>>> My indexer takes quite a few hours to be executed I am shortening >>> it >>>>>> to run faster, but I also need to make sure it gives what we are >>>> expecting. >>>>>> This implementation's been there for >4y, and massively used. >>>>>>>>>> >>>>>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are >>> extremely >>>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen >>>> years >>>>>> of configuring Solr. >>>>>>>>>> I've inherited that implementation and I am really keen to >>> adequate >>>>>> it, what would you recommend ? >>>>>>>>>> >>>>>>>>>> Cheers >>>>>>>>>> Guilherme >>>>>>>>>> >>>>>>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org >>>>>> <mailto:wun...@wunderwood.org>> wrote: >>>>>>>>>>> >>>>>>>>>>> Thanks for posting the files. Looking at schema.xml, I see that >>> you >>>>>> still are using StopFilterFactory. The first advice we gave you was to >>>>>> remove that. >>>>>>>>>>> >>>>>>>>>>> Remove StopFilterFactory everywhere and reindex. >>>>>>>>>>> >>>>>>>>>>> You will continue to have problems matching stopwords until you >>> do >>>>>> that. >>>>>>>>>>> >>>>>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are >>> extremely >>>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen >>>> years >>>>>> of configuring Solr. >>>>>>>>>>> >>>>>>>>>>> wunder >>>>>>>>>>> Walter Underwood >>>>>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/ >>>> >>>>>> (my blog) >>>>>>>>>>> >>>>>>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk >>>>>> <mailto:gvit...@ebi.ac.uk>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Paras, everyone >>>>>>>>>>>> >>>>>>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear >>>>>> you had trouble with the attachments I will host it somewhere and >>> share >>>> the >>>>>> links. >>>>>>>>>>>> I don't tweak my index, I get the data from the graph database, >>>>>> create a document as they are and save to solr. >>>>>>>>>>>> >>>>>>>>>>>> So, I am sending the new analysis screen querying the way you >>>>>> suggested. Also the results with params and solr query url. >>>>>>>>>>>> >>>>>>>>>>>> During the process of querying what you asked I found something >>>>>> really weird (at least for me). By accident, I ended up querying the >>>> using >>>>>> the default handler (/select) and it worked. Then If I use the one I >>>> must >>>>>> use, then sadly doesn't work. I am posting both results and I will >>> also >>>>>> post the handlers as well. >>>>>>>>>>>> >>>>>>>>>>>> Here is the link with all the files mentioned before >>>>>>>>>>>> >>>>>> >>>> >>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0< >>>>>> >>>> >>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0> >>>>>> < >>>> >>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 >>>>>> < >>>> >>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 >>>>>>>> >>>>>>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash >>>>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0 >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> >>>>>>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana < >>>> paras.leh...@indiamart.com >>>>>> <mailto:paras.leh...@indiamart.com>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Guilherme. >>>>>>>>>>>>> >>>>>>>>>>>>> I am sending they analysis result and the json result as >>>> requested. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low >>>>>> quality >>>>>>>>>>>>> though). >>>>>>>>>>>>> >>>>>>>>>>>>> From the analysis screen, the analysis is working as expected. >>>> One >>>>>> of the >>>>>>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not >>>> matching >>>>>>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can >>>>>> initially >>>>>>>>>>>>> think of is: the stopword "a" is probably present in >>>> post-analysis >>>>>> either >>>>>>>>>>>>> of query or index. Did you tweak your index time analysis after >>>>>> indexing? >>>>>>>>>>>>> >>>>>>>>>>>>> Do two things: >>>>>>>>>>>>> >>>>>>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory >>>>>>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and >>>>>>>>>>>>> "query=*"lymphoid >>>>>>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing >>>> the >>>>>> link >>>>>>>>>>>>> here. >>>>>>>>>>>>> 2. Give the same JSON output as you have sent but this time >>> with >>>>>>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson < >>>>>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or >>> some >>>>>> such. The >>>>>>>>>>>>>> Apache server is fairly aggressive about stripping attachments >>>>>> though, so >>>>>>>>>>>>>> it’s also possible they didn’t make it through. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri < >>>> gvit...@ebi.ac.uk >>>>>> <mailto:gvit...@ebi.ac.uk>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks Erick. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> First, your index and analysis chains are considerably >>>>>> different, this >>>>>>>>>>>>>> can easily be a source of problems. In particular, using two >>>>>> different >>>>>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against >>>>>> this unless >>>>>>>>>>>>>> you’re totally sure you understand the consequences. >>>>>> Additionally, your use >>>>>>>>>>>>>> of the length filter is suspicious, especially since your >>>> problem >>>>>> statement >>>>>>>>>>>>>> is about the addition of a single letter term and the min >>> length >>>>>> allowed on >>>>>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that >>> the >>>>>> ’a’ is >>>>>>>>>>>>>> filtered out in both cases, but maybe you’ve found something >>> odd >>>>>> about the >>>>>>>>>>>>>> interactions. >>>>>>>>>>>>>>> I will investigate the min length and post the results later. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal >>> signs >>>>>> typos? >>>>>>>>>>>>>> Used by custom code? >>>>>>>>>>>>>>> This the url in my application, not solr params. That's the >>>>>> query string. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s >>>> likely >>>>>> that >>>>>>>>>>>>>> all the params with an equal-sign are totally ignored unless >>>> it’s >>>>>> just a >>>>>>>>>>>>>> typo. >>>>>>>>>>>>>>> This is part of the application. Species will be used later >>> on >>>>>> in solr >>>>>>>>>>>>>> to filter out the result. That's not solr. That my app params. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Third, the easiest way to see what’s happening under the >>>> covers >>>>>> is to >>>>>>>>>>>>>> add “&debug=true” to the query and look at the parsed query. >>>>>> Ignore all the >>>>>>>>>>>>>> relevance calculations for the nonce, or specify >>> “&debug=query” >>>>>> to skip >>>>>>>>>>>>>> that part. >>>>>>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the >>>>>> explain tag >>>>>>>>>>>>>> is present. >>>>>>>>>>>>>>> I will try the searching the way you mentioned. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank for your inputs >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Guilherme >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson < >>>>>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Fwd to another server >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> First, your index and analysis chains are considerably >>>>>> different, this >>>>>>>>>>>>>> can easily be a source of problems. In particular, using two >>>>>> different >>>>>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against >>>>>> this unless >>>>>>>>>>>>>> you’re totally sure you understand the consequences. >>>>>> Additionally, your use >>>>>>>>>>>>>> of the length filter is suspicious, especially since your >>>> problem >>>>>> statement >>>>>>>>>>>>>> is about the addition of a single letter term and the min >>> length >>>>>> allowed on >>>>>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that >>> the >>>>>> ’a’ is >>>>>>>>>>>>>> filtered out in both cases, but maybe you’ve found something >>> odd >>>>>> about the >>>>>>>>>>>>>> interactions. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal >>> signs >>>>>> typos? >>>>>>>>>>>>>> Used by custom code? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>> < >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s >>>> likely >>>>>> that >>>>>>>>>>>>>> all the params with an equal-sign are totally ignored unless >>>> it’s >>>>>> just a >>>>>>>>>>>>>> typo. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Third, the easiest way to see what’s happening under the >>>> covers >>>>>> is to >>>>>>>>>>>>>> add “&debug=true” to the query and look at the parsed query. >>>>>> Ignore all the >>>>>>>>>>>>>> relevance calculations for the nonce, or specify >>> “&debug=query” >>>>>> to skip >>>>>>>>>>>>>> that part. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do >>>> what I >>>>>>>>>>>>>> expect” is answered by looking at the “&debug=query” output >>> and >>>>>> the >>>>>>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be >>>>>> sure to look >>>>>>>>>>>>>> at _both_ the query and index output. Also, and very important >>>>>> about the >>>>>>>>>>>>>> analysis page (and this is confusing) is that this _assumes_ >>>> that >>>>>> what you >>>>>>>>>>>>>> put in the text boxes have made it through the query parser >>>>>> intact and is >>>>>>>>>>>>>> analyzed by the field selected. Consider the search >>>>>> "q=field:word1 word2". >>>>>>>>>>>>>> Now you type “word1 word2” into the analysis text box and it >>>>>> looks like >>>>>>>>>>>>>> what you expect. That’s misleading because the query is >>> _parsed_ >>>>>> as >>>>>>>>>>>>>> "field:word1 default_search_field:word2”. This is where >>>>>> “&debug=query” >>>>>>>>>>>>>> helps. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> Erick >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana < >>>>>> paras.leh...@indiamart.com <mailto:paras.leh...@indiamart.com>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Walter, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. >>>>>> Those words >>>>>>>>>>>>>> will >>>>>>>>>>>>>>>>>> not be in the index, so they can never match a query. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I think the OP's concern is different results when adding a >>>>>> stopword. I >>>>>>>>>>>>>>>>> think he's using the filter factory correctly - the query >>>> chain >>>>>>>>>>>>>> includes >>>>>>>>>>>>>>>>> the filter as well so it should remove "a" while querying. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the >>>>>> document in >>>>>>>>>>>>>>>>> result you are concerned about and post full result of >>>>>> analysis screen >>>>>>>>>>>>>> (for >>>>>>>>>>>>>>>>> both query and index). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood < >>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> No. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. >>>>>> Those words >>>>>>>>>>>>>>>>>> will not be in the index, so they can never match a query. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every >>> analysis >>>>>> chain in >>>>>>>>>>>>>>>>>> schema.xml. >>>>>>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to >>> read >>>>>> the new >>>>>>>>>>>>>> config. >>>>>>>>>>>>>>>>>> 3. Reindex all of the documents. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords >>> will >>>>>> not be >>>>>>>>>>>>>>>>>> removed and they will be searchable. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> wunder >>>>>>>>>>>>>>>>>> Walter Underwood >>>>>>>>>>>>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>>>>>>>>>>>>>>>> http://observer.wunderwood.org/ < >>>>>> http://observer.wunderwood.org/> (my blog) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri < >>>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Ok. I am kind a lost now. >>>>>>>>>>>>>>>>>>> If I open up the console > analysis and perform it, >>> that's >>>>>> the final >>>>>>>>>>>>>>>>>> result. >>>>>>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> >>> in >>>>>> the >>>>>>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in >>>>>> stopwords.txt"," ") >>>>>>>>>>>>>> then >>>>>>>>>>>>>>>>>> add to solr. Is that correct ? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks David >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings < >>>>>>>>>>>>>> hastings.recurs...@gmail.com <mailto: >>>> hastings.recurs...@gmail.com >>>>>>> >>>>>>>>>>>>>>>>>> <mailto:hastings.recurs...@gmail.com <mailto: >>>>>> hastings.recurs...@gmail.com>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Fwd to another server >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> no, >>>>>>>>>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>> ignoreCase="true" >>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my >>>>>> opinion of >>>>>>>>>>>>>> course, >>>>>>>>>>>>>>>>>>>> based on your use case may be different, but i generally >>>>>> axe any >>>>>>>>>>>>>>>>>> reference >>>>>>>>>>>>>>>>>>>> to them at all >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri < >>>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk> >>>>>>>>>>>>>>>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>> >>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>>> Haven't I done this here ? >>>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField" >>>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" > >>>>>>>>>>>>>>>>>>>>> <analyzer type="index"> >>>>>>>>>>>>>>>>>>>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>>>>>>>>>>>>>>>>>>> <filter class="solr.ClassicFilterFactory"/> >>>>>>>>>>>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>>>>>>>>>>> max="20"/> >>>>>>>>>>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>> ignoreCase="true" >>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings < >>>>>>>>>>>>>> hastings.recurs...@gmail.com <mailto: >>>> hastings.recurs...@gmail.com >>>>>>> >>>>>>>>>>>>>>>>>> <mailto:hastings.recurs...@gmail.com <mailto: >>>>>> hastings.recurs...@gmail.com>>> >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Fwd to another server >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference >>> to >>>>>> stop >>>>>>>>>>>>>> words >>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it >>>> again. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri < >>>>>>>>>>>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk> >>>>>>>>>>>>>>>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>> >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I am performing a search to match a name >>> (text_field), >>>>>> however >>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>>> term >>>>>>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any >>>>>> records. If i >>>>>>>>>>>>>> remove >>>>>>>>>>>>>>>>>>>>> 'a' >>>>>>>>>>>>>>>>>>>>>>> then it works. >>>>>>>>>>>>>>>>>>>>>>> e.g >>>>>>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell >>>>>>>>>>>>>>>>>>>>>>> doesn't work: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>> < >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>> >>>>>>>>>>>>>>>>>> < >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>> < >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> < >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>> < >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell >>>>>>>>>>>>>>>>>>>>>>> works: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>> < >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>> >>>>>>>>>>>>>>>>>>>>>>> < >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>> < >>>>>> >>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> interested in the first result >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> schema.xml >>>>>>>>>>>>>>>>>>>>>>> <field name="name" >>>>>> type="text_field" >>>>>>>>>>>>>>>>>>>>>>> indexed="true" stored="true" omitNorms="false" >>>>>>>>>>>>>> required="true" >>>>>>>>>>>>>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> <analyzer type="query"> >>>>>>>>>>>>>>>>>>>>>>> <tokenizer class="solr.PatternTokenizerFactory" >>>>>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>>>>>>>>>>>>>> max="20"/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>>>>>>>>>>>>>> ignoreCase="true" >>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField" >>>>>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" > >>>>>>>>>>>>>>>>>>>>>>> <analyzer type="index"> >>>>>>>>>>>>>>>>>>>>>>> <tokenizer >>> class="solr.StandardTokenizerFactory"/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.ClassicFilterFactory"/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>>>>>>>>>>>>>> max="20"/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>>>>>>>>>>>>>> ignoreCase="true" >>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>>>>>>>>>>>> <analyzer type="query"> >>>>>>>>>>>>>>>>>>>>>>> <tokenizer class="solr.PatternTokenizerFactory" >>>>>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>>>>>>>>>>>>>> max="20"/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>>>>>>>>>>>>>> ignoreCase="true" >>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>>>>>>>>>>>> </fieldType> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> stopwords.txt >>>>>>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's >>>>>> StopAnalyzer >>>>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>> b >>>>>>>>>>>>>>>>>>>>>>> c >>>>>>>>>>>>>>>>>>>>>>> .... >>>>>>>>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Running SolR 6.6.2. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>> Guilherme >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *Paras Lehana* [65871] >>>>>>>>>>>>>>>>> Development Engineer, Auto-Suggest, >>>>>>>>>>>>>>>>> IndiaMART Intermesh Ltd. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>>>>>>>>>>>>>>>> Noida, UP, IN - 201303 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Mob.: +91-9560911996 >>>>>>>>>>>>>>>>> Work: 01203916600 | Extn: *8173* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> IMPORTANT: >>>>>>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> -- >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> >>>>>>>>>>>>> *Paras Lehana* [65871] >>>>>>>>>>>>> Development Engineer, Auto-Suggest, >>>>>>>>>>>>> IndiaMART Intermesh Ltd. >>>>>>>>>>>>> >>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>>>>>>>>>>>> Noida, UP, IN - 201303 >>>>>>>>>>>>> >>>>>>>>>>>>> Mob.: +91-9560911996 >>>>>>>>>>>>> Work: 01203916600 | Extn: *8173* >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> IMPORTANT: >>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> -- >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Paras Lehana [65871] >>>>>>>>> Development Engineer, Auto-Suggest, >>>>>>>>> IndiaMART Intermesh Ltd. >>>>>>>>> >>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>>>>>>>> Noida, UP, IN - 201303 >>>>>>>>> >>>>>>>>> Mob.: +91-9560911996 <tel:+91-9560911996> >>>>>>>>> Work: 01203916600 | Extn: 8173 >>>>>>>>> >>>>>>>>> IMPORTANT: >>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone. >>>>>>> >>>>>> >>>>>> >>>> >>>> >>> >> >> >> -- >> -- >> Regards, >> >> *Paras Lehana* [65871] >> Development Engineer, Auto-Suggest, >> IndiaMART Intermesh Ltd. >> >> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >> Noida, UP, IN - 201303 >> >> Mob.: +91-9560911996 >> Work: 01203916600 | Extn: *8173* >> >> -- >> IMPORTANT: >> NEVER share your IndiaMART OTP/ Password with anyone. > >