Re: Shingles behavior

Radu Gheorghe Thu, 21 May 2020 21:45:54 -0700

Turns out, it’s down to setting enableGraphQueries=false in the field 
definition. I completely missed that :(


> On 21 May 2020, at 07:49, Radu Gheorghe <radu.gheor...@sematext.com> wrote:
> 
> Hi Alex, long time no see :)
> 
> I tried with sow, and that basically invalidates query-time shingles (it only 
> mathes mona OR lisa OR smile).
> 
> I'm using shingles at both index and query time as a substitute for pf2 and 
> pf3: the more shingles I match, the more relevant the document. Also, higher 
> order shingles naturally get lower frequencies, meaning they get a "natural" 
> boost.
> 
> Best regards,
> Radu
> 
> joi, 21 mai 2020, 00:28 Alexandre Rafalovitch <arafa...@gmail.com> a scris:
> Did you try it with 'sow' parameter both ways? I am not sure I fully
> understand the question, especially with shingling on both passes
> rather than just indexing one. But at least it is something to try and
> is one of the difference areas between Solr and ES.
> 
> Regards,
>    Alex.
> 
> On Tue, 19 May 2020 at 05:59, Radu Gheorghe <radu.gheor...@sematext.com> 
> wrote:
> >
> > Hello Solr users,
> >
> > I’m quite puzzled about how shingles work. The way tokens are analysed 
> > looks fine to me, but the query seems too restrictive.
> >
> > Here’s the sample use-case. I have three documents:
> >
> > mona lisa smile
> > mona lisa
> > mona
> >
> > I have a shingle filter set up like this (both index- and query-time):
> >
> > > <filter class="solr.ShingleFilterFactory" minShingleSize="2" 
> > > maxShingleSize=“4”/>
> >
> > When I query for “Mona Lisa smile” (no quotes), I expect to get all three 
> > documents back, in that order. Because the first document matches all the 
> > terms:
> >
> > mona
> > mona lisa
> > mona lisa smile
> > lisa
> > lisa smile
> > smile
> >
> > And the second one matches only some, and the third document only matches 
> > one.
> >
> > Instead, I only get the first document back. That’s because the query 
> > expects all the “words” to match:
> >
> > > "parsedquery":"+DisjunctionMaxQuery((((+shingle_field:mona 
> > > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona 
> > > +shingle_field:lisa smile) (+shingle_field:mona lisa 
> > > +shingle_field:smile) shingle_field:mona lisa smile)))”,
> >
> > The query above is generated by the Edismax query parser, when I’m using 
> > “shingle_field” as “df”.
> >
> > Is there a way to get “any of the words” to match? I’ve tried all the 
> > options I can think of:
> > - different query parsers
> > - q.OP=OR
> > - mm=0 (or 1 or 0% or 10% or…)
> >
> > Nothing seems to change the parsed query from the above.
> >
> > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by 
> > default, and minimum_should_match works as expected. The only difference I 
> > see between the two, on the analysis side, is that tokens start at 0 in 
> > Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see 
> > that the default “text_en”, for example, also starts at position 1.
> >
> > Is it just a bug that mm doesn’t work in the context of shingles? Or is 
> > there a workaround?
> >
> > Thanks and best regards,
> > Radu

Re: Shingles behavior

Reply via email to