Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

Erick Erickson Thu, 21 Mar 2019 18:37:42 -0700

sow was introduced in Solr 6, so it’s just ignored in 4x.

bq. Surely the tokenizer splits on white space anyway, or it wouldn't work?

I didn’t work on that code, so I don’t have the details off the top of my head, 
but I’ll take a stab at it as far as my understanding goes. The result is in 
your parsed queries.

Note that       in the better-behaved case, you have a bunch of individual 
tokens ORd together like:
productdetails_tokens_en:9611444530
productdetails_tokens_en:9611444530

 and that’s all. IOW, the query parser has split them into individual tokens 
that are fed one at a time into the analysis chain.

In the bad case you have a bunch of single tokens as well, but then what look 
like multiple tokens, but are not:
+productdetails_tokens_en:9611444500
+productdetails_tokens_en:9612194002 9612194002 9612194002)

which is where the explosion is coming from. It’s deceptive, because when 
shingling, this is a single token "9612194002 9612194002 9612194002” for all it 
looks like something that’d be split by whitespace. 

If you take a look at your admin UI>>your_core>>schema and select your 
productdetails_tokens_en from the drop down and then “load terms” you’ll see. 
If you want to experiment, you can add a tokenSeparator character other than a 
space to the shinglefilter that’ll make it clearer. Then the clause above that 
looks like multiple, whitespace-separated tokens would look like what it really 
is, a single token:

+productdetails_tokens_en:9612194002_9612194002_9612194002)

Best,
Erick

> On Mar 21, 2019, at 3:10 PM, Hubert-Price, Neil <neil.hubert-pr...@sap.com> 
> wrote:
> 
> Surely the tokenizer splits on white space anyway, or it wouldn't work?

Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

Reply via email to