Hi mck,

On 09/09/2008 at 12:58 PM, Mck wrote:
> > *ShortVersion*
> >  is there a way to make the ShingleFilter perform exact matching via
> > inserting ^ $ begin/end markers?
> 
> Reading through the mailing list i see how exact matching can
> be done, a la STFW to myself...
> 
> So the ShortVersion now stands:
> 
> For my query "abcd efgh ijkl"
> Why does a (perfect looking) MultiPhraseQuery with
>       termArrays[0] = { list_entry_shingles:abcd
>                         list_entry_shingles:abcd efgh
>                         list_entry_shingles:abcd efgh ijkl
>                       }
>       termArrays[1] = { list_entry_shingles:efgh
>                         list_entry_shingles:efgh ijkl
>                       }
>       termArrays[2] = { list_entry_shingles:ijkl }
> 
> return only "abcd efgh ijkl" !?
> 
> (when the field is indexed as TextField this is the only hit i get)
> (when the field is indexed as StrField i get zero hits!)
> 
> When the index contains 9 entries:
>  "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh",
> "ijkl", "ijkl efgh", "efgh abcd", and "ijkl efgh abcd".
> 
> Does this MultiPhraseQuery actually require a match against
> *every* item in each termArray on any document?

I've never used MultiPhraseQuery, but I *think* (based on the Javadocs) that it 
requires one  match from each termArrays[] entry, contiguously, in the same 
sequence as the termArrays[] entries (unless you add slop, which I don't think 
you're doing).

A TextField index would have ("abcd", "efgh", "ijkl") for the "abcd efgh ijkl" 
document (assuming you used WhitespaceAnalyzer, which I believe you showed in 
one of your emails); unlike all of the other documents, one member from each of 
your query's termArrays[] entries is sequentially present, so I think that the 
behavior you're seeing is expected.  If you add "abcd efgh ijkl mnop" as a 
document, it should also be matched.

Looks to me like MultiPhraseQuery is getting in the way.  Shingles that begin 
at the same word are given the same position by ShingleFilter, and Solr's 
FieldQParserPlugin creates a MultiPhraseQuery when it encounters tokens in a 
query with the same position.  I think what you want is to convert queries into 
shingle disjunctions (*any* matching shingle results in a hit),  right?

Any Solr cognoscenti know how to arrange for Solr's query parser to avoid 
invoking MultiPhraseQuery?

Steve

Reply via email to