Folks,

I am seeing very strange (bad) wildcard behavior (solr 8).  

"kinase" finds hits as expected.  

"kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like 
"kinase," and "kinase-" but not "kinase"

I have done the analysis as Erick suggested (thanks!) but it is not helping me 
understand why we'd have this problem.

I have put together 12 screenshots from the Solr web UI that show in detail:
- the queries I ran to get the results above
- various analyses trying to understand why
- the schema for the fieldType in question

https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing

thanks,
steve

-----Original Message-----
From: Sotiris Fragkiskos <sfra...@gmail.com> 
Sent: Thursday, February 13, 2020 4:03 AM
To: solr-user@lucene.apache.org
Subject: [External] Re: wildcards match end-of-word?

Hi Erick,
thanks very much for this information, it was immensely useful, I always had 
the same question!
I'm now seeing the Analysis page and finally I don't have to rely on an 
external online stemmer to see what solr *probably* stemmed the term to!!
But I still can't make the asterisk and question mark work inside the term, 
even in the earlier parts of it.
e.g. tr?ining
I would expect it to match train. But it doesn't.
PSF at the end just shows t | ain
every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF) Am I 
doing something very wrong??

thanks again!
Sotiri

On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> Steve:
>
> You _really_ want to get acquainted with the admin UI/Analysis page ;).
> Choose a core/collection and you should see the choice. It shows you 
> exactly what transformations your data goes through. If you hover over 
> the light gray pairs of letters, you’ll get a tooltip showing you what 
> part of your analysis chain is responsible for a particular change. I 
> un-check the “verbose” box 95% of the time BTW.
>
> The critical bit is that what comes out of the end of the analysis 
> pipe are the tokens that are actually _in_ the index. From there, 
> problems like this make more sense.
>
> My bet is that, as Walter says, you have a stemmer in the analysis 
> chain and the actual token in the index is “kinas” so of course 
> “kinase*” won’t be found. By adding OR kinase to the query, that token 
> is stemmed to “kinas” and matches.
>
> Also, adding &debug=query to your URL will show you what the query 
> looks like after parsing and analysis, also a major tool for figuring 
> out what’s really happening.
>
> Wildcards are not stemmed, which can lead to surprising results. 
> There’s no perfect answer here. Let’s claim wildcards _were_ stemmed. 
> Then you’d have to try to explain why “running*” returned a doc with 
> only “run” or “runner” or “runs” or... in it, but searching for 
> “runnin*” did not due the stemmer not recognizing it as a stemmable word.
>
> Finally, one of my personal hot buttons is wildcards in general. 
> They’re very often over-used because people are used to simple search 
> capabilities.
> Something about “if your only tool is a hammer, every problem looks 
> like a nail”. That gets into training users too though...
>
> Best,
> Erick
>
> > On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
> sfisc...@pennmedicine.upenn.edu> wrote:
> >
> > Hi,
> >
> > I am a solr newbie.  I was surprised to discover that a search for
> kinase* returned fewer results than kinase.
> >
> > Then I read the wildcard documentation<
> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
> l#TheStandardQueryParser-WildcardSearches>,
> and saw why.  kinase* will not match the word "kinase".
> >
> > Our end-users won't expect this behavior.  Presumably the solution 
> > would
> be for them (actually us, on their behalf), to use kinase* OR kinase.
> >
> > But that is kind of a hack.
> >
> > Is there a way we can configure solr to have wildcards match on
> end-of-word?
> >
> > Thanks,
> > Steve
>
>

Reply via email to