Hi, I'm in the process of revising a schema for the search function of an eCommerce platform. One of the sticking points is a particular use case of searching for "xx yy" where xx is any number and yy is an abbreviation for a unit of measurement (mm, cc, ml, in, etc.). The problem is that searching for "xx yy" and "xxyy" return different results. One possible solution I tried was applying a few PatternReplaceCharFilterFactories to remove the whitespace between xx and yy if there was any (at both index- and query-time). These are the first few lines in the analyzer:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?i)(\d+)\s?(pounds?|lbs?)" replacement="$1lb" /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?i)(\d+)\s?(inch[es]?|in?)" replacement="$1in" /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?i)(\d+)\s?(ounc[es]?|oz)" replacement="$1oz" /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?i)(\d+)\s?(quarts?|qts?)" replacement="$1qt" /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?i)(\d+)\s?(gallons?|gal?)" replacement="$1gal" /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?i)(\d+)\s?(mm|cc|ml)" replacement="$1$2" /> A few more lines down, I use a PatternCaptureGroupFilterFactory to emit the tokens "xxyy", "xx", and "yy": <filter class="solr.PatternCaptureGroupFilterFactory" pattern="(\d+)(lb|oz|in|qt|gal|mm|cc|ml)" preserve_original="true" /> In Solr admin's analysis tool for the field type this applies to, both "xx yy" and "xxyy" are tokenized and filtered down indentically (at both index- and -query time). The platform I'm working on searches many different fields by default, but even when I rig up the query to only search in this one field, I still get different results for "xxyy" and "xx yy". I'm wondering why this is. Attached is a screenshot from Solr analysis. Thanks, John