Hi Alex, long time no see :) I tried with sow, and that basically invalidates query-time shingles (it only mathes mona OR lisa OR smile).
I'm using shingles at both index and query time as a substitute for pf2 and pf3: the more shingles I match, the more relevant the document. Also, higher order shingles naturally get lower frequencies, meaning they get a "natural" boost. Best regards, Radu joi, 21 mai 2020, 00:28 Alexandre Rafalovitch <arafa...@gmail.com> a scris: > Did you try it with 'sow' parameter both ways? I am not sure I fully > understand the question, especially with shingling on both passes > rather than just indexing one. But at least it is something to try and > is one of the difference areas between Solr and ES. > > Regards, > Alex. > > On Tue, 19 May 2020 at 05:59, Radu Gheorghe <radu.gheor...@sematext.com> > wrote: > > > > Hello Solr users, > > > > I’m quite puzzled about how shingles work. The way tokens are analysed > looks fine to me, but the query seems too restrictive. > > > > Here’s the sample use-case. I have three documents: > > > > mona lisa smile > > mona lisa > > mona > > > > I have a shingle filter set up like this (both index- and query-time): > > > > > <filter class="solr.ShingleFilterFactory" minShingleSize="2" > maxShingleSize=“4”/> > > > > When I query for “Mona Lisa smile” (no quotes), I expect to get all > three documents back, in that order. Because the first document matches all > the terms: > > > > mona > > mona lisa > > mona lisa smile > > lisa > > lisa smile > > smile > > > > And the second one matches only some, and the third document only > matches one. > > > > Instead, I only get the first document back. That’s because the query > expects all the “words” to match: > > > > > "parsedquery":"+DisjunctionMaxQuery((((+shingle_field:mona > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona > +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) > shingle_field:mona lisa smile)))”, > > > > The query above is generated by the Edismax query parser, when I’m using > “shingle_field” as “df”. > > > > Is there a way to get “any of the words” to match? I’ve tried all the > options I can think of: > > - different query parsers > > - q.OP=OR > > - mm=0 (or 1 or 0% or 10% or…) > > > > Nothing seems to change the parsed query from the above. > > > > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” > by default, and minimum_should_match works as expected. The only difference > I see between the two, on the analysis side, is that tokens start at 0 in > Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see > that the default “text_en”, for example, also starts at position 1. > > > > Is it just a bug that mm doesn’t work in the context of shingles? Or is > there a workaround? > > > > Thanks and best regards, > > Radu >