Hello Solr users, I’m quite puzzled about how shingles work. The way tokens are analysed looks fine to me, but the query seems too restrictive.
Here’s the sample use-case. I have three documents: mona lisa smile mona lisa mona I have a shingle filter set up like this (both index- and query-time): > <filter class="solr.ShingleFilterFactory" minShingleSize="2" > maxShingleSize=“4”/> When I query for “Mona Lisa smile” (no quotes), I expect to get all three documents back, in that order. Because the first document matches all the terms: mona mona lisa mona lisa smile lisa lisa smile smile And the second one matches only some, and the third document only matches one. Instead, I only get the first document back. That’s because the query expects all the “words” to match: > "parsedquery":"+DisjunctionMaxQuery((((+shingle_field:mona > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona > +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) > shingle_field:mona lisa smile)))”, The query above is generated by the Edismax query parser, when I’m using “shingle_field” as “df”. Is there a way to get “any of the words” to match? I’ve tried all the options I can think of: - different query parsers - q.OP=OR - mm=0 (or 1 or 0% or 10% or…) Nothing seems to change the parsed query from the above. I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by default, and minimum_should_match works as expected. The only difference I see between the two, on the analysis side, is that tokens start at 0 in Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see that the default “text_en”, for example, also starts at position 1. Is it just a bug that mm doesn’t work in the context of shingles? Or is there a workaround? Thanks and best regards, Radu