Is there a good way of handling a minimum match value greater than 1 with token 
filters that add tokens to the stream?

Say you have field with the DoubleMetaphone filter for phonetic matching:

<filter class="solr.DoubleMetaphoneFilterFactory" inject="true" 
maxCodeLength="6"/>

This would add two tokens to the stream, one for the primary phonetic code, one 
for the secondary.  If I have the min match set to 2 (mm=2) and my query only 
has a single token in it, then I only get results where at least 2 of the 
tokens match.  This means that documents that only match on a phonetic token 
aren't included.

Example:

Field:
<fieldType name="name " class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.DoubleMetaphoneFilterFactory" inject="true" 
maxCodeLength="6"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

Document:
{ id: 1, lastName: "meneghini" } (This generates {meneghini, MNKN} for  the 
index token stream for the lastName field)

Searching (using edismax) with q=meneghini&mm=2 returns document 1, as 
expected, but searching q=menegini&mm=2 does not.  However q=menegini&mm=1 
does.  The reason the first query worked as expected is that after the phonetic 
filter the query token stream has 2 tokens (meneghini, MNKN), and both of them 
match the index tokens, satisfying the mm parameter.  With the phonetic 
misspelling (menegini, {menegini, MNJN, MNKN}), only one of the tokens out of 
the 3 matches, so it is below the mm threshold.  The third query only needs one 
match, which it gets on the phonetic code MNKN.

This seems like counter-intuitive behavior for mm (at least for my use case), 
since I'm only interested in the original query terms being subject to the mm 
limitation, not the expanded token set.  I would imagine this would be an issue 
with synonym expansion and any other filter that might add tokens at query time 
as well.

Possible solutions I've thought of:


-          Just use the regular PhoneticFilterFactory with inject="false" in a 
separate copy field since it will only emit one token per input token.  :(

-          Subclass the DoubleMetaphoneFilterFactory to add a parameter to 
specify if only the primary or secondary token should be emitted.  Then have a 
separate field type and copy field for each and search the original field, the 
primary phonetic token field, and the secondary token field with each query.  
This only solves for this specific case with the double metaphone filter, since 
it will add at most 2 tokens.  Other filters like BeiderMorseFilterFactory or 
SynonymFilterFactory might add an arbitrary number.

-          Change {lots of things} to allow filters to set a flag on a token 
that the query parser can use to determine that it should not count it against 
the minimum match requirement.

-          ?

Any thoughts?

Matt

Reply via email to