Re: Shingles behavior

Radu Gheorghe Wed, 20 May 2020 21:50:09 -0700

Hi Alex, long time no see :)

I tried with sow, and that basically invalidates query-time shingles (it
only mathes mona OR lisa OR smile).


I'm using shingles at both index and query time as a substitute for pf2 and
pf3: the more shingles I match, the more relevant the document. Also,
higher order shingles naturally get lower frequencies, meaning they get a
"natural" boost.

Best regards,
Radu

joi, 21 mai 2020, 00:28 Alexandre Rafalovitch <arafa...@gmail.com> a scris:

> Did you try it with 'sow' parameter both ways? I am not sure I fully
> understand the question, especially with shingling on both passes
> rather than just indexing one. But at least it is something to try and
> is one of the difference areas between Solr and ES.
>
> Regards,
>    Alex.
>
> On Tue, 19 May 2020 at 05:59, Radu Gheorghe <radu.gheor...@sematext.com>
> wrote:
> >
> > Hello Solr users,
> >
> > I’m quite puzzled about how shingles work. The way tokens are analysed
> looks fine to me, but the query seems too restrictive.
> >
> > Here’s the sample use-case. I have three documents:
> >
> > mona lisa smile
> > mona lisa
> > mona
> >
> > I have a shingle filter set up like this (both index- and query-time):
> >
> > > <filter class="solr.ShingleFilterFactory" minShingleSize="2"
> maxShingleSize=“4”/>
> >
> > When I query for “Mona Lisa smile” (no quotes), I expect to get all
> three documents back, in that order. Because the first document matches all
> the terms:
> >
> > mona
> > mona lisa
> > mona lisa smile
> > lisa
> > lisa smile
> > smile
> >
> > And the second one matches only some, and the third document only
> matches one.
> >
> > Instead, I only get the first document back. That’s because the query
> expects all the “words” to match:
> >
> > > "parsedquery":"+DisjunctionMaxQuery((((+shingle_field:mona
> +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona
> +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile)
> shingle_field:mona lisa smile)))”,
> >
> > The query above is generated by the Edismax query parser, when I’m using
> “shingle_field” as “df”.
> >
> > Is there a way to get “any of the words” to match? I’ve tried all the
> options I can think of:
> > - different query parsers
> > - q.OP=OR
> > - mm=0 (or 1 or 0% or 10% or…)
> >
> > Nothing seems to change the parsed query from the above.
> >
> > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR”
> by default, and minimum_should_match works as expected. The only difference
> I see between the two, on the analysis side, is that tokens start at 0 in
> Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see
> that the default “text_en”, for example, also starts at position 1.
> >
> > Is it just a bug that mm doesn’t work in the context of shingles? Or is
> there a workaround?
> >
> > Thanks and best regards,
> > Radu
>

Re: Shingles behavior

Reply via email to