Ok, I thought that it was somehow expected, but what bothers me is that if
I use min and max = 2 or min and max = 3, it grows linearly, but when I
change to min = 2 and max = 3, the number of tokens explode.

What I expect it was going to do was to make first the 2 shingles clauses
and after the 3 shingles one, making something like:

text_shingles:word1_word2 text_shingles:word2_word3
text_shingles:word3_word4       text_shingles:word1_word2_word3
text_shingles:word2_word3_word4, i

Actually, if I analyze the field in the output it's ok, but when it uses
that information to create query it creates a lot of groups.


But when the query gets build it explodes with so many clauses

for example, the term " text_shingles:word4 word5" appears 4 times, and as
you grow the same term repeats even more, when I though that each term
should appear 1 time in each query.

5 words:
"parsedquery":"+DisjunctionMaxQuery((((+text_shingles:word1
+text_shingles:word2 +text_shingles:word3 +text_shingles:word4
+text_shingles:word5) (+text_shingles:word1 +text_shingles:word2
+text_shingles:word3 +text_shingles:word4 word5) (+text_shingles:word1
+text_shingles:word2 +text_shingles:word3 word4 +text_shingles:word5)
(+text_shingles:word1 +text_shingles:word2 +text_shingles:word3 word4
word5) (+text_shingles:word1 +text_shingles:word2 word3
+text_shingles:word4 +text_shingles:word5) (+text_shingles:word1
+text_shingles:word2 word3 +text_shingles:word4 word5)
(+text_shingles:word1 +text_shingles:word2 word3 word4
+text_shingles:word5) (+text_shingles:word1 word2 +text_shingles:word3
+text_shingles:word4 +text_shingles:word5) (+text_shingles:word1 word2
+text_shingles:word3 +text_shingles:word4 word5) (+text_shingles:word1
word2 +text_shingles:word3 word4 +text_shingles:word5)
(+text_shingles:word1 word2 +text_shingles:word3 word4 word5)
(+text_shingles:word1 word2 word3 +text_shingles:word4
+text_shingles:word5) (+text_shingles:word1 word2 word3 +text_shingles:word4
word5))))",



On Fri, Jul 27, 2018 at 1:38 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> This is doing exactly what it should. It'd be a little clearer if you
> used a tokenSeparator other than the default space. Then this line:
>
> text_shingles:word1 word2 word3+text_shingles:word4 word5
>
> would look more like this:
> text_shingles:word1_word2_word3+text_shingles:word4_word5
>
> It's building a query from all of the 1, 2 and 3 grams. You're getting
> the single tokens because outputUnigrams defaults to "true".
>
> So of course as the number of terms in the query grows the number of
> clauses int he parsed query grows non-linearly.
>
> Best,
> Erick
>
> On Thu, Jul 26, 2018 at 12:44 PM, Jokin C <joki...@jokincuadrado.com>
> wrote:
> > Hi, I have a problem and I don't know if it's something that am and doing
> > wrong or if it's maybe a bug. I want to query a field with shingles, the
> > field and type definition are this:
> >
> > <field name="text_shingles" type="text_en_shingles" indexed="true"
> > stored="false"/>
> >
> > <fieldType name="text_en_shingles" class="solr.TextField"
> > positionIncrementGap="100">
> >     <analyzer >
> >       <tokenizer class="solr.StandardTokenizerFactory"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >       <filter class="solr.ShingleFilterFactory" minShingleSize="2"
> > maxShingleSize="3" />
> >     </analyzer>
> >   </fieldType>
> >
> >
> > I'm using Solr  7.2.1.
> >
> > I jus wanted to have different min and max shingle sizes to test how ir
> > works, but if the query is long solr is giving timeouts, high cpu and
> OOM.
> >
> > the query I'm using is this:
> >
> > http://localhost:8983/solr/ntnx/select?debugQuery=on&q={!
> edismax%20%20qf=%22text_shingles%22%20}%22%20word1%
> 20word2%20word3%20word4%20word5%20word6%20word7
> >
> > and the parsed query grows like this with just 4 words, when I use a
> query
> > with a lot of word it fails.
> >
> > 2 words:
> > "parsedquery":"+DisjunctionMaxQuery((((+text_shingles:word1
> > +text_shingles:word2) text_shingles:word1 word2)))",
> >
> > 3words:
> > "parsedquery":"+DisjunctionMaxQuery((((+text_shingles:word1
> > +text_shingles:word2 +text_shingles:word3) (+text_shingles:word1
> > +text_shingles:word2 word3) (+text_shingles:word1 word2
> > +text_shingles:word3) text_shingles:word1 word2 word3)))",
> >
> > 4 words:
> > "parsedquery":"+DisjunctionMaxQuery((((+text_shingles:word1
> > +text_shingles:word2 +text_shingles:word3 +text_shingles:word4)
> > (+text_shingles:word1 +text_shingles:word2 +text_shingles:word3 word4)
> > (+text_shingles:word1 +text_shingles:word2 word3 +text_shingles:word4)
> > (+text_shingles:word1 +text_shingles:word2 word3 word4)
> > (+text_shingles:word1 word2 +text_shingles:word3 +text_shingles:word4)
> > (+text_shingles:word1 word2 +text_shingles:word3 word4)
> > (+text_shingles:word1 word2 word3 +text_shingles:word4))))",
> >
> > 5 words:
> > "parsedquery":"+DisjunctionMaxQuery((((+text_shingles:word1
> > +text_shingles:word2 +text_shingles:word3 +text_shingles:word4
> > +text_shingles:word5) (+text_shingles:word1 +text_shingles:word2
> > +text_shingles:word3 +text_shingles:word4 word5) (+text_shingles:word1
> > +text_shingles:word2 +text_shingles:word3 word4 +text_shingles:word5)
> > (+text_shingles:word1 +text_shingles:word2 +text_shingles:word3 word4
> > word5) (+text_shingles:word1 +text_shingles:word2 word3
> > +text_shingles:word4 +text_shingles:word5) (+text_shingles:word1
> > +text_shingles:word2 word3 +text_shingles:word4 word5)
> > (+text_shingles:word1 +text_shingles:word2 word3 word4
> > +text_shingles:word5) (+text_shingles:word1 word2 +text_shingles:word3
> > +text_shingles:word4 +text_shingles:word5) (+text_shingles:word1 word2
> > +text_shingles:word3 +text_shingles:word4 word5) (+text_shingles:word1
> > word2 +text_shingles:word3 word4 +text_shingles:word5)
> > (+text_shingles:word1 word2 +text_shingles:word3 word4 word5)
> > (+text_shingles:word1 word2 word3 +text_shingles:word4
> > +text_shingles:word5) (+text_shingles:word1 word2 word3
> > +text_shingles:word4 word5))))",
> >
> >
> > So, something bad is happening, it's because I'm doing wrong or maybe
> its a
> > bug and should I report on the team issue tracker?
>

Reply via email to