Hello Solr users,

I’m quite puzzled about how shingles work. The way tokens are analysed looks 
fine to me, but the query seems too restrictive.

Here’s the sample use-case. I have three documents:

mona lisa smile
mona lisa
mona

I have a shingle filter set up like this (both index- and query-time):

> <filter class="solr.ShingleFilterFactory" minShingleSize="2" 
> maxShingleSize=“4”/>

When I query for “Mona Lisa smile” (no quotes), I expect to get all three 
documents back, in that order. Because the first document matches all the terms:

mona
mona lisa
mona lisa smile
lisa
lisa smile
smile

And the second one matches only some, and the third document only matches one.

Instead, I only get the first document back. That’s because the query expects 
all the “words” to match:

> "parsedquery":"+DisjunctionMaxQuery((((+shingle_field:mona 
> +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona 
> +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) 
> shingle_field:mona lisa smile)))”,

The query above is generated by the Edismax query parser, when I’m using 
“shingle_field” as “df”.

Is there a way to get “any of the words” to match? I’ve tried all the options I 
can think of:
- different query parsers
- q.OP=OR
- mm=0 (or 1 or 0% or 10% or…)

Nothing seems to change the parsed query from the above.

I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by 
default, and minimum_should_match works as expected. The only difference I see 
between the two, on the analysis side, is that tokens start at 0 in 
Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see that 
the default “text_en”, for example, also starts at position 1.

Is it just a bug that mm doesn’t work in the context of shingles? Or is there a 
workaround?

Thanks and best regards,
Radu

Reply via email to