Hi Alexey,

Lucene's QueryParser, and at least some of Solr's query parsers - I'm not 
familiar with all of them - have the problem you mention: analyzers are fed 
queries word-by-word, instead of whole strings between operators.  There is a 
JIRA issue for fixing this, but no work done yet: 
<https://issues.apache.org/jira/browse/LUCENE-2605>.

Separately, do you know about the "raw" query parser[2]?  I'm not sure if it 
would help, but you may be able to use it in alternate solution.

One small simplification I can think of for your current setup: 
ShingleFilterFactory[1] takes an option called "tokenSeparator" - if you set 
this to the empty string (""), you can eliminate your whitespace-stripping 
filter.

Steve

[1] 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory
[2] 
http://wiki.apache.org/solr/SolrQuerySyntax#Other_built-in_useful_query_parsers

> -----Original Message-----
> From: Alexey Verkhovsky [mailto:alexey.verkhov...@gmail.com]
> Sent: Monday, February 27, 2012 1:26 PM
> To: solr-user@lucene.apache.org
> Subject: Combining ShingleFilter and DisMaxParser, with a twist
> 
> Say, there is an index of business names (fairly short text snippets),
> containing: Walmart, Walmart Bakery and Mini Mart. And say we need a query
> for 'wal mart' to match all three, with an appropriate ranking order. Also
> need 'walmart', 'walmart bakery' and 'bakery' to find the right things in
> the right order.
> 
> Here is the solution we came up with:
> 
> 1. edismax query parser (we don't need it for this, but do for a number of
> other requirements)
> 
> 2. On the index, apply ShingleFilter, then remove word separators in the
> shingles, so that "walmart bakery" is indexed as  "walmart", "bakery",
> "walmartbakery"
>     Schema for this index looks like this:
>       <analyzer type="index">
>         <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="'+" replacement=""/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.ShingleFilterFactory" minShingleSize="2"
> maxShingleSize="3" outputUnigrams="true"/>
>         <filter class="solr.PatternReplaceFilterFactory" pattern="\W+"
> replacement=""/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
> 
> 3. Before sending the original query to Solr, modify it by adding a
> whitespace-stripped version of it. Thus, 'wal mart' becomes 'wal mart
> walmart' and walmart bakery becomes 'walmart bakery walmartbakery'. Don't
> modify the query if it only has one word in it, or contains any edismax
> syntax (double quotes; pluses and minuses in the beginning of a query or
> after whitespace).
> 
> 4. ... profit.
> 
> The reason we have to shingle the query before Solr is that edismax parser
> treats 'wal mart' as two queries - 'wal' OR 'mart', so applying the
> ShingleFilter in the query analyzer doesn't do anything.
> 
> This works, but feels a little dirty. Is there a more elegant way to solve
> this problem?
> 
> --
> Alex Verkhovsky

Reply via email to