Re: Matching Queries with Wildcards and Numbers

Erick Erickson Wed, 17 Jun 2015 19:32:53 -0700

This one's going to be confusing to explain.....

The ability of filters to operate on wildcarded terms at query time is limited
to some specific filters. If you're going into the code, see
MultiTermAware-derived
filters.


Most generally, the MultiTermAware filters only are valid for filters
that do _not_
produce more than one output token for a given input token. Gibberish, I know,
but bear with me.

WordDelimiterFilterFactory is _NOT_ MultiTermAware because, you guessed it,
it can produce more than one token per input token at query time. Specifically
in your example, at index time it'll produce tokens "Sidem" and "2".

However, at query time for "Sidem2" it will just pass the token
through complete.
And since the token is not in your index, it's not found. Hmm, I wonder what
the admin/analysis page would show here....

Anyway, you probably can get what you want by changing the index time
definition of WDFF from catenateAll="0" to catenateAll="1". That will put
Sidem, 2, and Sidem2 in your index. Then the fact that query time processing
for wildcards does _not_ break things up, Sidem2 will go through at query time.
Then the doc should be found.

Of course you have to reindex your docs after the change.

Trying to allow wildcards for filters at query time that emit multiple
output tokens
per input token is an utter and complete disaster.

HTH,
Erick


On Wed, Jun 17, 2015 at 10:56 AM, Ellington Kirby
<ellingtonkirb...@gmail.com> wrote:
> Hi! I am a Solr user having an issue with matches on searches using the
> wildcard operators, specifically when the searches include a wildcard
> operator with a number. Here is an example.
> My query will look like (productTitle:*Sidem2*) and match nothing, when it
> should be matching the productTitle Sidem2. However, searching for Sidem
> will match the productTitle Sidem2. In addition, I have isolated it to only
> fail to match when the productTitle has a number in it, for example a query
> for (productTitle:*Cupx Collapsed*) will correctly match the product Cupx
> Collapsed. I need to use the wildcard operators around the query so that an
> auto-complete feature can be used, where if a user stops typing at a
> certain point, a search will be executed on their input so far and it will
> match the correct product titles. I have looked all over, through the
> excellent book Solr In Action by Grainger and Potter, through Stack
> Overflow and several blog posts and have not found anything on this
> specific issue. Common advice is to remove the stemmer, which I have done.
> I have also added the ReversedWildcardFilterFactory. Here is a copy of my
> schema for the specific fieldType if that is any help. Please let me know
> if anyone has any tips or clues! I am not a very experienced Solr user and
> would really appreciate any advice.
>
>
>   <fieldType name="text_en_splitting" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>         <analyzer type="index">
>             <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>             <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         -->
>             <!-- Case insensitive stop word removal.
>         -->
>             <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="lang/stopwords_en.txt"
>                 />
>             <!-- Concatenate characters and numbers by setting catenateAll
> to 1 - this will avoid problems with alphabetical sort -->
>             <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>             <filter class="solr.LowerCaseFilterFactory"/>
>             <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>             <filter class="solr.ReversedWildcardFilterFactory"
> withOriginal="true"
>              maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2"
> maxFractionAsterisk="0"/>
>         </analyzer>
>         <analyzer type="query">
>             <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>             <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>             <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="lang/stopwords_en.txt"
>                 />
>             <!-- Concatenate characters and numbers by setting catenateAll
> to 1 - this will avoid problems with alphabetical sort -->
>             <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
>             <filter class="solr.LowerCaseFilterFactory"/>
>             <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>         </analyzer>
>     </fieldType>
>
>
> Thank you in advance!
> --From a sincerely puzzled Solr user, Ellington Kirby

Re: Matching Queries with Wildcards and Numbers

Reply via email to