Hi,

I'm in the process of revising a schema for the search function of an
eCommerce platform.  One of the sticking points is a particular use case of
searching for "xx yy" where xx is any number and yy is an abbreviation for
a unit of measurement (mm, cc, ml, in, etc.).  The problem is that
searching for "xx yy" and "xxyy" return different results. One possible
solution I tried was applying a few PatternReplaceCharFilterFactories to
remove the whitespace between xx and yy if there was any (at both index-
and query-time).  These are the first few lines in the analyzer:

<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?i)(\d+)\s?(pounds?|lbs?)" replacement="$1lb" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?i)(\d+)\s?(inch[es]?|in?)" replacement="$1in" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?i)(\d+)\s?(ounc[es]?|oz)" replacement="$1oz" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?i)(\d+)\s?(quarts?|qts?)" replacement="$1qt" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?i)(\d+)\s?(gallons?|gal?)" replacement="$1gal" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?i)(\d+)\s?(mm|cc|ml)" replacement="$1$2" />

A few more lines down, I use a PatternCaptureGroupFilterFactory to emit the
tokens "xxyy", "xx", and "yy":

<filter class="solr.PatternCaptureGroupFilterFactory"
pattern="(\d+)(lb|oz|in|qt|gal|mm|cc|ml)" preserve_original="true" />

In Solr admin's analysis tool for the field type this applies to, both "xx
yy" and "xxyy" are tokenized and filtered down indentically (at both index-
and -query time).

The platform I'm working on searches many different fields by default, but
even when I rig up the query to only search in this one field, I still get
different results for "xxyy" and "xx yy".  I'm wondering why this is.

Attached is a screenshot from Solr analysis.

Thanks,
John

Reply via email to