Hi,
I could be wrong, but I'm starting to think that it has to do with the
fieldType. In our case, wildcards don't seem to work at all with text_en
types, but they do work with string types.

On Thu, Feb 13, 2020 at 1:52 PM Fischer, Stephen <
sfisc...@pennmedicine.upenn.edu> wrote:

> Folks,
>
> I am seeing very strange (bad) wildcard behavior (solr 8).
>
> "kinase" finds hits as expected.
>
> "kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like
> "kinase," and "kinase-" but not "kinase"
>
> I have done the analysis as Erick suggested (thanks!) but it is not
> helping me understand why we'd have this problem.
>
> I have put together 12 screenshots from the Solr web UI that show in
> detail:
> - the queries I ran to get the results above
> - various analyses trying to understand why
> - the schema for the fieldType in question
>
>
> https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing
>
> thanks,
> steve
>
> -----Original Message-----
> From: Sotiris Fragkiskos <sfra...@gmail.com>
> Sent: Thursday, February 13, 2020 4:03 AM
> To: solr-user@lucene.apache.org
> Subject: [External] Re: wildcards match end-of-word?
>
> Hi Erick,
> thanks very much for this information, it was immensely useful, I always
> had the same question!
> I'm now seeing the Analysis page and finally I don't have to rely on an
> external online stemmer to see what solr *probably* stemmed the term to!!
> But I still can't make the asterisk and question mark work inside the
> term, even in the earlier parts of it.
> e.g. tr?ining
> I would expect it to match train. But it doesn't.
> PSF at the end just shows t | ain
> every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF)
> Am I doing something very wrong??
>
> thanks again!
> Sotiri
>
> On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > Steve:
> >
> > You _really_ want to get acquainted with the admin UI/Analysis page ;).
> > Choose a core/collection and you should see the choice. It shows you
> > exactly what transformations your data goes through. If you hover over
> > the light gray pairs of letters, you’ll get a tooltip showing you what
> > part of your analysis chain is responsible for a particular change. I
> > un-check the “verbose” box 95% of the time BTW.
> >
> > The critical bit is that what comes out of the end of the analysis
> > pipe are the tokens that are actually _in_ the index. From there,
> > problems like this make more sense.
> >
> > My bet is that, as Walter says, you have a stemmer in the analysis
> > chain and the actual token in the index is “kinas” so of course
> > “kinase*” won’t be found. By adding OR kinase to the query, that token
> > is stemmed to “kinas” and matches.
> >
> > Also, adding &debug=query to your URL will show you what the query
> > looks like after parsing and analysis, also a major tool for figuring
> > out what’s really happening.
> >
> > Wildcards are not stemmed, which can lead to surprising results.
> > There’s no perfect answer here. Let’s claim wildcards _were_ stemmed.
> > Then you’d have to try to explain why “running*” returned a doc with
> > only “run” or “runner” or “runs” or... in it, but searching for
> > “runnin*” did not due the stemmer not recognizing it as a stemmable word.
> >
> > Finally, one of my personal hot buttons is wildcards in general.
> > They’re very often over-used because people are used to simple search
> capabilities.
> > Something about “if your only tool is a hammer, every problem looks
> > like a nail”. That gets into training users too though...
> >
> > Best,
> > Erick
> >
> > > On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
> > sfisc...@pennmedicine.upenn.edu> wrote:
> > >
> > > Hi,
> > >
> > > I am a solr newbie.  I was surprised to discover that a search for
> > kinase* returned fewer results than kinase.
> > >
> > > Then I read the wildcard documentation<
> > https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
> > l#TheStandardQueryParser-WildcardSearches>,
> > and saw why.  kinase* will not match the word "kinase".
> > >
> > > Our end-users won't expect this behavior.  Presumably the solution
> > > would
> > be for them (actually us, on their behalf), to use kinase* OR kinase.
> > >
> > > But that is kind of a hack.
> > >
> > > Is there a way we can configure solr to have wildcards match on
> > end-of-word?
> > >
> > > Thanks,
> > > Steve
> >
> >
>

Reply via email to