Re: [CAUTION] Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

Hubert-Price, Neil Fri, 22 Mar 2019 01:02:53 -0700

One other question ....

Is there a system level configuration that can change the default for the sow= 
parameter?  Can it be flipped to have the default set to true?


Many Thanks,
Neil

On 22/03/2019, 08:36, "Hubert-Price, Neil" <neil.hubert-pr...@sap.com> wrote:

    Thanks Erick, that makes sense.
    
    However it does lead me to another conclusion: in Solr prior to 6.0, or 
with sow=true on Solr 6.0+ .... that would mean that the ShingleFilter is 
totally ineffective within query analysers. It would be logically equivalent to 
not having the ShingleFilter configured at all.
    
    The point of the ShingleFilter as I understand it is to create 
combinations/permutations, but there are none possible surely if it receives 
only one pre-split token at a time.
    
    Going back to my original configuration, I think to achieve the same result 
as in Solr 4.6 - I would need to remove ShingleFilterFactory from the query 
analyser config for that field type?
    
    Many Thanks,
    Neil
    
    Sent from my iPhone
    
    > On 22 Mar 2019, at 02:38, Erick Erickson <erickerick...@gmail.com> wrote:
    > 
    > sow was introduced in Solr 6, so it’s just ignored in 4x.
    > 
    > bq. Surely the tokenizer splits on white space anyway, or it wouldn't 
work?
    > 
    > I didn’t work on that code, so I don’t have the details off the top of my 
head, but I’ll take a stab at it as far as my understanding goes. The result is 
in your parsed queries.
    > 
    > Note that    in the better-behaved case, you have a bunch of individual 
tokens ORd together like:
    > productdetails_tokens_en:9611444530
    > productdetails_tokens_en:9611444530
    > 
    > and that’s all. IOW, the query parser has split them into individual 
tokens that are fed one at a time into the analysis chain.
    > 
    > In the bad case you have a bunch of single tokens as well, but then what 
look like multiple tokens, but are not:
    > +productdetails_tokens_en:9611444500
    > +productdetails_tokens_en:9612194002 9612194002 9612194002)
    > 
    > which is where the explosion is coming from. It’s deceptive, because when 
shingling, this is a single token "9612194002 9612194002 9612194002” for all it 
looks like something that’d be split by whitespace. 
    > 
    > If you take a look at your admin UI>>your_core>>schema and select your 
productdetails_tokens_en from the drop down and then “load terms” you’ll see. 
If you want to experiment, you can add a tokenSeparator character other than a 
space to the shinglefilter that’ll make it clearer. Then the clause above that 
looks like multiple, whitespace-separated tokens would look like what it really 
is, a single token:
    > 
    > +productdetails_tokens_en:9612194002_9612194002_9612194002)
    > 
    > Best,
    > Erick
    > 
    >> On Mar 21, 2019, at 3:10 PM, Hubert-Price, Neil 
<neil.hubert-pr...@sap.com> wrote:
    >> 
    >> Surely the tokenizer splits on white space anyway, or it wouldn't work?
    >

Re: [CAUTION] Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

Reply via email to