Ok, I thought that it was somehow expected, but what bothers me is that if I use min and max = 2 or min and max = 3, it grows linearly, but when I change to min = 2 and max = 3, the number of tokens explode.
What I expect it was going to do was to make first the 2 shingles clauses and after the 3 shingles one, making something like: text_shingles:word1_word2 text_shingles:word2_word3 text_shingles:word3_word4 text_shingles:word1_word2_word3 text_shingles:word2_word3_word4, i Actually, if I analyze the field in the output it's ok, but when it uses that information to create query it creates a lot of groups. But when the query gets build it explodes with so many clauses for example, the term " text_shingles:word4 word5" appears 4 times, and as you grow the same term repeats even more, when I though that each term should appear 1 time in each query. 5 words: "parsedquery":"+DisjunctionMaxQuery((((+text_shingles:word1 +text_shingles:word2 +text_shingles:word3 +text_shingles:word4 +text_shingles:word5) (+text_shingles:word1 +text_shingles:word2 +text_shingles:word3 +text_shingles:word4 word5) (+text_shingles:word1 +text_shingles:word2 +text_shingles:word3 word4 +text_shingles:word5) (+text_shingles:word1 +text_shingles:word2 +text_shingles:word3 word4 word5) (+text_shingles:word1 +text_shingles:word2 word3 +text_shingles:word4 +text_shingles:word5) (+text_shingles:word1 +text_shingles:word2 word3 +text_shingles:word4 word5) (+text_shingles:word1 +text_shingles:word2 word3 word4 +text_shingles:word5) (+text_shingles:word1 word2 +text_shingles:word3 +text_shingles:word4 +text_shingles:word5) (+text_shingles:word1 word2 +text_shingles:word3 +text_shingles:word4 word5) (+text_shingles:word1 word2 +text_shingles:word3 word4 +text_shingles:word5) (+text_shingles:word1 word2 +text_shingles:word3 word4 word5) (+text_shingles:word1 word2 word3 +text_shingles:word4 +text_shingles:word5) (+text_shingles:word1 word2 word3 +text_shingles:word4 word5))))", On Fri, Jul 27, 2018 at 1:38 AM, Erick Erickson <erickerick...@gmail.com> wrote: > This is doing exactly what it should. It'd be a little clearer if you > used a tokenSeparator other than the default space. Then this line: > > text_shingles:word1 word2 word3+text_shingles:word4 word5 > > would look more like this: > text_shingles:word1_word2_word3+text_shingles:word4_word5 > > It's building a query from all of the 1, 2 and 3 grams. You're getting > the single tokens because outputUnigrams defaults to "true". > > So of course as the number of terms in the query grows the number of > clauses int he parsed query grows non-linearly. > > Best, > Erick > > On Thu, Jul 26, 2018 at 12:44 PM, Jokin C <joki...@jokincuadrado.com> > wrote: > > Hi, I have a problem and I don't know if it's something that am and doing > > wrong or if it's maybe a bug. I want to query a field with shingles, the > > field and type definition are this: > > > > <field name="text_shingles" type="text_en_shingles" indexed="true" > > stored="false"/> > > > > <fieldType name="text_en_shingles" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer > > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.ShingleFilterFactory" minShingleSize="2" > > maxShingleSize="3" /> > > </analyzer> > > </fieldType> > > > > > > I'm using Solr 7.2.1. > > > > I jus wanted to have different min and max shingle sizes to test how ir > > works, but if the query is long solr is giving timeouts, high cpu and > OOM. > > > > the query I'm using is this: > > > > http://localhost:8983/solr/ntnx/select?debugQuery=on&q={! > edismax%20%20qf=%22text_shingles%22%20}%22%20word1% > 20word2%20word3%20word4%20word5%20word6%20word7 > > > > and the parsed query grows like this with just 4 words, when I use a > query > > with a lot of word it fails. > > > > 2 words: > > "parsedquery":"+DisjunctionMaxQuery((((+text_shingles:word1 > > +text_shingles:word2) text_shingles:word1 word2)))", > > > > 3words: > > "parsedquery":"+DisjunctionMaxQuery((((+text_shingles:word1 > > +text_shingles:word2 +text_shingles:word3) (+text_shingles:word1 > > +text_shingles:word2 word3) (+text_shingles:word1 word2 > > +text_shingles:word3) text_shingles:word1 word2 word3)))", > > > > 4 words: > > "parsedquery":"+DisjunctionMaxQuery((((+text_shingles:word1 > > +text_shingles:word2 +text_shingles:word3 +text_shingles:word4) > > (+text_shingles:word1 +text_shingles:word2 +text_shingles:word3 word4) > > (+text_shingles:word1 +text_shingles:word2 word3 +text_shingles:word4) > > (+text_shingles:word1 +text_shingles:word2 word3 word4) > > (+text_shingles:word1 word2 +text_shingles:word3 +text_shingles:word4) > > (+text_shingles:word1 word2 +text_shingles:word3 word4) > > (+text_shingles:word1 word2 word3 +text_shingles:word4))))", > > > > 5 words: > > "parsedquery":"+DisjunctionMaxQuery((((+text_shingles:word1 > > +text_shingles:word2 +text_shingles:word3 +text_shingles:word4 > > +text_shingles:word5) (+text_shingles:word1 +text_shingles:word2 > > +text_shingles:word3 +text_shingles:word4 word5) (+text_shingles:word1 > > +text_shingles:word2 +text_shingles:word3 word4 +text_shingles:word5) > > (+text_shingles:word1 +text_shingles:word2 +text_shingles:word3 word4 > > word5) (+text_shingles:word1 +text_shingles:word2 word3 > > +text_shingles:word4 +text_shingles:word5) (+text_shingles:word1 > > +text_shingles:word2 word3 +text_shingles:word4 word5) > > (+text_shingles:word1 +text_shingles:word2 word3 word4 > > +text_shingles:word5) (+text_shingles:word1 word2 +text_shingles:word3 > > +text_shingles:word4 +text_shingles:word5) (+text_shingles:word1 word2 > > +text_shingles:word3 +text_shingles:word4 word5) (+text_shingles:word1 > > word2 +text_shingles:word3 word4 +text_shingles:word5) > > (+text_shingles:word1 word2 +text_shingles:word3 word4 word5) > > (+text_shingles:word1 word2 word3 +text_shingles:word4 > > +text_shingles:word5) (+text_shingles:word1 word2 word3 > > +text_shingles:word4 word5))))", > > > > > > So, something bad is happening, it's because I'm doing wrong or maybe > its a > > bug and should I report on the team issue tracker? >