Say, there is an index of business names (fairly short text snippets),
containing: Walmart, Walmart Bakery and Mini Mart. And say we need a query
for 'wal mart' to match all three, with an appropriate ranking order. Also
need 'walmart', 'walmart bakery' and 'bakery' to find the right things in
the right order.

Here is the solution we came up with:

1. edismax query parser (we don't need it for this, but do for a number of
other requirements)

2. On the index, apply ShingleFilter, then remove word separators in the
shingles, so that "walmart bakery" is indexed as  "walmart", "bakery",
"walmartbakery"
    Schema for this index looks like this:
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="'+" replacement=""/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" minShingleSize="2"
maxShingleSize="3" outputUnigrams="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="\W+"
replacement=""/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>

3. Before sending the original query to Solr, modify it by adding a
whitespace-stripped version of it. Thus, 'wal mart' becomes 'wal mart
walmart' and walmart bakery becomes 'walmart bakery walmartbakery'. Don't
modify the query if it only has one word in it, or contains any edismax
syntax (double quotes; pluses and minuses in the beginning of a query or
after whitespace).

4. ... profit.

The reason we have to shingle the query before Solr is that edismax parser
treats 'wal mart' as two queries - 'wal' OR 'mart', so applying the
ShingleFilter in the query analyzer doesn't do anything.

This works, but feels a little dirty. Is there a more elegant way to solve
this problem?

-- 
Alex Verkhovsky

Reply via email to