Thanks Erick, that makes sense.

However it does lead me to another conclusion: in Solr prior to 6.0, or with 
sow=true on Solr 6.0+ .... that would mean that the ShingleFilter is totally 
ineffective within query analysers. It would be logically equivalent to not 
having the ShingleFilter configured at all.

The point of the ShingleFilter as I understand it is to create 
combinations/permutations, but there are none possible surely if it receives 
only one pre-split token at a time.

Going back to my original configuration, I think to achieve the same result as 
in Solr 4.6 - I would need to remove ShingleFilterFactory from the query 
analyser config for that field type?

Many Thanks,
Neil

Sent from my iPhone

> On 22 Mar 2019, at 02:38, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> sow was introduced in Solr 6, so it’s just ignored in 4x.
> 
> bq. Surely the tokenizer splits on white space anyway, or it wouldn't work?
> 
> I didn’t work on that code, so I don’t have the details off the top of my 
> head, but I’ll take a stab at it as far as my understanding goes. The result 
> is in your parsed queries.
> 
> Note that    in the better-behaved case, you have a bunch of individual 
> tokens ORd together like:
> productdetails_tokens_en:9611444530
> productdetails_tokens_en:9611444530
> 
> and that’s all. IOW, the query parser has split them into individual tokens 
> that are fed one at a time into the analysis chain.
> 
> In the bad case you have a bunch of single tokens as well, but then what look 
> like multiple tokens, but are not:
> +productdetails_tokens_en:9611444500
> +productdetails_tokens_en:9612194002 9612194002 9612194002)
> 
> which is where the explosion is coming from. It’s deceptive, because when 
> shingling, this is a single token "9612194002 9612194002 9612194002” for all 
> it looks like something that’d be split by whitespace. 
> 
> If you take a look at your admin UI>>your_core>>schema and select your 
> productdetails_tokens_en from the drop down and then “load terms” you’ll see. 
> If you want to experiment, you can add a tokenSeparator character other than 
> a space to the shinglefilter that’ll make it clearer. Then the clause above 
> that looks like multiple, whitespace-separated tokens would look like what it 
> really is, a single token:
> 
> +productdetails_tokens_en:9612194002_9612194002_9612194002)
> 
> Best,
> Erick
> 
>> On Mar 21, 2019, at 3:10 PM, Hubert-Price, Neil <neil.hubert-pr...@sap.com> 
>> wrote:
>> 
>> Surely the tokenizer splits on white space anyway, or it wouldn't work?
> 

Reply via email to