Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

Hubert-Price, Neil Wed, 20 Mar 2019 02:01:10 -0700

Hello All,

We have a recently upgraded system that went from Solr 4.6 to Solr 7.1 (used as 
part of an ecommerce application).  In the upgraded version we are seeing 
frequent issues with very high Solr memory usage for certain types of query, 
but the older 4.6 version does not produce the same response.


Having taken a heap dump and investigated, we can see instances of individual 
Solr threads where the retained set is 4GB to 5GB in size.  Drilling into this 
we can see a particular subquery with over 500,000 clauses.  Screenshots below 
are from Eclipse MAT viewing a heap dump from the SOLR process. Observations of 
the 4.6 version we can see memory increments of 100-200 MB for the same query, 
rather than 4-5 GB.

In both systems the index has around 2 million documents, with average size 
around 8KB.


[cid:image001.png@01D4DF03.B9ADD460]

[cid:image002.png@01D4DF03.B9ADD460]


The subquery with a very large set of clauses relates to a particular field 
setup to use ShingleFilter (with maxShingleSize=30, and outputUnigrams=true). 
Schema.xml definitions for this field are:

<fieldType name="lowercase_tokens" class="solr.TextField" 
positionIncrementGap="100">
                                <analyzer type="index">
                                                <tokenizer 
class="solr.WhitespaceTokenizerFactory" />
                                                <filter 
class="solr.StandardFilterFactory" />
                                                <filter 
class="solr.LowerCaseFilterFactory" />
                                                <filter 
class="solr.ShingleFilterFactory" maxShingleSize="30" outputUnigrams="true"/>
                                </analyzer>
                </fieldType>

                <field name="productdetails_tokens_en" type="lowercase_tokens" 
indexed="true" stored="false" multiValued="true"/>

                <copyField source="supercategoryname_text_en" 
dest="productdetails_tokens_en" />
                <copyField source="supercategorydescription_text_en" 
dest="productdetails_tokens_en" />
                <copyField source="productNameAndDescription_text_en" 
dest="productdetails_tokens_en" />
                <copyField source="code_string" dest="productdetails_tokens_en" 
/>

The issue happens when the user search contains large numbers of tokens.  In 
the example screenshots above the user search text had 20 tokens. The Solr 
query for that thread was as below (formatting/indentation added by me, the 
original is one long string).  This specific query contains tabs, however the 
same behaviour happens when spaces are used as well:
(
+(
  fulltext_en:(9611444500            9611444520       9611444530       
9611444540       9611414550 9612194002                9612194002       
9612194002       9612194003       9612194007 9611416470             9611416470  
     9611416470                9611416480       9611416480 9613484402           
  9613484402       9613484402       9613484402       9613484402)
  OR productdetails_tokens_en:(9611444500       9611444520       9611444530     
  9611444540       9611414550 9612194002       9612194002       9612194002      
 9612194003       9612194007 9611416470             9611416470                
9611416470       9611416480       9611416480 9613484402             9613484402  
     9613484402       9613484402                9613484402)
  OR codePartial:(9611444500     9611444520       9611444530       9611444540   
    9611414550 9612194002                9612194002       9612194002       
9612194003       9612194007 9611416470             9611416470       9611416470  
              9611416480       9611416480 9613484402             9613484402     
  9613484402       9613484402       9613484402)
)
)
AND
(
(
  (
   (productChannelVisibility_string_mv:ALL OR 
productChannelVisibility_string_mv:EBUSINESS OR 
productChannelVisibility_string_mv:INTERNET OR 
productChannelVisibility_string_mv:INTRANET)
   AND
   !productChannelVisibility_string_mv:NOTVISIBLE
  )
  AND
  (
   +(
    fulltext_en:(9611444500          9611444520       9611444530       
9611444540       9611414550 9612194002                9612194002       
9612194002       9612194003       9612194007 9611416470             9611416470  
     9611416470                9611416480       9611416480 9613484402           
  9613484402       9613484402       9613484402       9613484402)
    OR productdetails_tokens_en:(9611444500     9611444520       9611444530     
  9611444540       9611414550 9612194002       9612194002       9612194002      
 9612194003       9612194007 9611416470             9611416470                
9611416470       9611416480       9611416480 9613484402             9613484402  
     9613484402       9613484402                9613484402)
    OR codePartial:(9611444500  9611444520       9611444530       9611444540    
   9611414550 9612194002                9612194002       9612194002       
9612194003       9612194007 9611416470             9611416470       9611416470  
              9611416480       9611416480 9613484402             9613484402     
  9613484402       9613484402       9613484402)
   )
  )
)
)

In the heap dump we can see the subqueries relating to fulltext_en/codePartial 
fields both have just 20 clauses.  However the two subqueries relating to 
productdetails_tokens_en both have 524288 clauses & each of those clauses is a 
subquery with up to 20 clauses (each of which seems to be a different shingled 
combination of the original tokens). For example, selecting an arbitrary single 
entry from the 524288 clauses, we can see a subquery with the following clauses:

Occur.MUST, productdetails_tokens_en: 9611444500
Occur.MUST, productdetails_tokens_en: 9611416470 9611416480
Occur.MUST, productdetails_tokens_en: 9611444520
Occur.MUST, productdetails_tokens_en: 9611444540
Occur.MUST, productdetails_tokens_en: 9612194007
Occur.MUST, productdetails_tokens_en: 9611444530
Occur.MUST, productdetails_tokens_en: 9612194002 9612194002
Occur.MUST, productdetails_tokens_en: 9612194002
Occur.MUST, productdetails_tokens_en: 9611416480
Occur.MUST, productdetails_tokens_en: 9611416470
Occur.MUST, productdetails_tokens_en: 9613484402
Occur.MUST, productdetails_tokens_en: 9612194003
Occur.MUST, productdetails_tokens_en: 9611414550
Occur.MUST, productdetails_tokens_en: 9613484402 9613484402 9613484402


So the question has two parts:

-          Is this the observed behaviour expected in Solr 7.1 given the 
setup/query described above? (It seems to me that the answer is probably yes, 
because this is the purpose of the ShingleFilter)

-          Why is the same behaviour not in evidence in Solr 4.6?  Are there 
major differences with the way that the query is constructed in the earlier 
version.  If so, can we change Solr 7.1 config to behave more like Solr 4.6?

Many Thanks,
Neil




Neil Hubert-Price
Senior Consultant, SAP CX Success and Services, Northern Europe

neil.hubert-pr...@sap.com<mailto:neil.hubert-pr...@sap.com>
M: +44 7788 368767


SAP (UK) Limited, Registered in England No. 2152073. Registered Office: 
Clockhouse Place, Bedfont Road, Feltham, Middlesex, TW14 8HD

Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

Reply via email to