Re: wildcards match end-of-word?

Walter Underwood Thu, 13 Feb 2020 09:43:12 -0800

Remove the stopword and stemmer filters from your schema and reindex.

Removing stopwords means you can never match “vitamin a”.


Stemming interferes with wildcard matches. Either stem or do wildcards on a 
field, not both.

Also, what do your users expect to get with wildcard matches? Those are a slow 
and imprecise way to search. There is almost always a better way.

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)

> On Feb 13, 2020, at 1:03 AM, Sotiris Fragkiskos <[email protected]> wrote:
> 
> Hi Erick,
> thanks very much for this information, it was immensely useful, I always
> had the same question!
> I'm now seeing the Analysis page and finally I don't have to rely on an
> external online stemmer to see what solr *probably* stemmed the term to!!
> But I still can't make the asterisk and question mark work inside the term,
> even in the earlier parts of it.
> e.g. tr?ining
> I would expect it to match train. But it doesn't.
> PSF at the end just shows t | ain
> every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF)
> Am I doing something very wrong??
> 
> thanks again!
> Sotiri
> 
> On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson <[email protected]>
> wrote:
> 
>> Steve:
>> 
>> You _really_ want to get acquainted with the admin UI/Analysis page ;).
>> Choose a core/collection and you should see the choice. It shows you
>> exactly what transformations your data goes through. If you hover over the
>> light gray pairs of letters, you’ll get a tooltip showing you what part of
>> your analysis chain is responsible for a particular change. I un-check the
>> “verbose” box 95% of the time BTW.
>> 
>> The critical bit is that what comes out of the end of the analysis pipe
>> are the tokens that are actually _in_ the index. From there, problems like
>> this make more sense.
>> 
>> My bet is that, as Walter says, you have a stemmer in the analysis chain
>> and the actual token in the index is “kinas” so of course “kinase*” won’t
>> be found. By adding OR kinase to the query, that token is stemmed to
>> “kinas” and matches.
>> 
>> Also, adding &debug=query to your URL will show you what the query looks
>> like after parsing and analysis, also a major tool for figuring out what’s
>> really happening.
>> 
>> Wildcards are not stemmed, which can lead to surprising results. There’s
>> no perfect answer here. Let’s claim wildcards _were_ stemmed. Then you’d
>> have to try to explain why “running*” returned a doc with only “run” or
>> “runner” or “runs” or... in it, but searching for “runnin*” did not due the
>> stemmer not recognizing it as a stemmable word.
>> 
>> Finally, one of my personal hot buttons is wildcards in general. They’re
>> very often over-used because people are used to simple search capabilities.
>> Something about “if your only tool is a hammer, every problem looks like a
>> nail”. That gets into training users too though...
>> 
>> Best,
>> Erick
>> 
>>> On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
>> [email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> I am a solr newbie.  I was surprised to discover that a search for
>> kinase* returned fewer results than kinase.
>>> 
>>> Then I read the wildcard documentation<
>> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-WildcardSearches>,
>> and saw why.  kinase* will not match the word "kinase".
>>> 
>>> Our end-users won't expect this behavior.  Presumably the solution would
>> be for them (actually us, on their behalf), to use kinase* OR kinase.
>>> 
>>> But that is kind of a hack.
>>> 
>>> Is there a way we can configure solr to have wildcards match on
>> end-of-word?
>>> 
>>> Thanks,
>>> Steve
>> 
>>

Re: wildcards match end-of-word?

Reply via email to