Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

Hubert-Price, Neil Thu, 21 Mar 2019 15:11:25 -0700

Hi Erick,

I've run a series of tests using debug=true, the same original query, and 
variations around sow=true/sow=false/not set.  See links below for .txt files 
containing the output.  I have removed any genuine document content and 
replaced it with ...... because I don't have the customer's permission to post 
their data.  However the debug info, etc should still be usable.


Points to note:
 - All queries that completed returned the same set of documents
 - Solr 7.1 on the original configuration, query succeeds only if sow=true is 
passed.
 - Solr 7.1 with the config change mentioned earlier, all 3 succeed however 
both original/sow=false have higher QTime and longer parsed queries
 - With Solr 7.1 sow=true the behaviour seems to be the same with/without the 
reconfiguration
 - The Solr 4.6 output seems to be much the same for all 3 attempts, except for 
variations in QTime.  However that may be because the server is older + mostly 
unused currently.  I assume this sow parameter isn't supported in 4.6?

Solr 4.6 Original Query: 
https://drive.google.com/open?id=1vRn-2NabuKoJshqxXpQ-kOeJ8G-gcZlu
Solr 4.6 sow=false: 
https://drive.google.com/open?id=1nAvMvm9LNb-gA3UIDFI-eJzqaOhToQPV
Solr 4.6 sow=true: 
https://drive.google.com/open?id=14PRJG459poLe634E75T68wClJLg0tXWp
Solr 7.1 Original Config sow=true: 
https://drive.google.com/open?id=1q1iNfef6-LmqNjI7gTWxUJLsNx9C2U6v
Solr 7.1 Reconfigured Original Query: 
https://drive.google.com/open?id=138KYW7MCobU_3MZhC4lAhWgvWTaspK2N
Solr 7.1 Reconfigured sow=false: 
https://drive.google.com/open?id=127ZIKtSvivn5SJ4sLR25iu-mUCMW8bCu
Solr 7.1 Reconfigured sow=true: 
https://drive.google.com/open?id=1UJVHzQjgeF4fJ4ILnf4YYag5wdmWi3uS


So this sow=true config has a very definite effect in Solr 7.1 for us at least. 
 I'm unclear how that affects the behaviour of the query though?  Surely the 
tokenizer splits on white space anyway, or it wouldn't work?  Can you explain 
any more about the purpose of this & when it was introduced?

Many Thanks,
Neil


On 21/03/2019, 16:06, "Erick Erickson" <erickerick...@gmail.com> wrote:

    Neil:
    
    Yeah, the attachment-stripping is catches everyone first time, we’re so 
used to just adding anything we want to an e-mail…
    
    I don’t know enough about the query parsing to answer off the top of my 
head. I do know one thing that’s changed is “Split on Whitespace” has changed 
from true to false by default, so it’d be interesting to add &sow=false to the 
query.
    
    Beyond that, take a look at what &debug=query added to the URL returns. My 
guess is that it’ll be identical but it’s worth a look.
    
    Sorry I can’t be more help here
    Erick
    
    > On Mar 21, 2019, at 1:11 AM, Hubert-Price, Neil 
<neil.hubert-pr...@sap.com> wrote:
    > 
    > Hello Erick,
    > 
    > This is the first time I've had reason to use the mailing list, so I 
wasn't aware of the behaviour around attachments.  See below, links to the 
images that I originally sent as attachments, both are screenshots from within 
Eclipse MAT looking at a SOLR heap dump.
    > 
    > LargeQueryStructure.png - 
https://drive.google.com/open?id=1SkRYav2iV6Z1znmzr4KKJzMcXzNF0_Wg 
    > LargeNumberClauses.png - 
https://drive.google.com/open?id=1CaySU2HzyvHsdbIW_n0190ofjPS3hAeN
    > 
    > The LargeQueryStructure image shows as single thread with retained set of 
4.8GB, with the biggest items being a BooleanWeight object of just over 1.8GB 
and a BooleanQuery object of just under 1.8GB
    > 
    > The LargeNumberClauses image shows a drilldown into the BooleanQuery 
object, where a subquery is taking around 0.9GB and contains a 
BooleanClause[524288] array of clauses (not shown: each of these 524288 is 
actually a subquery with multiple clauses).  The array is taking 0.6GB, and 
there is a second instance of the same array in another subquery (also not 
shown).
    > 
    > 
    > Since the last email we have had some success with a reconfiguration of 
the fieldType that I referenced in my original email below.  Where it was 
originally:
    > 
    > <fieldType name="lowercase_tokens" class="solr.TextField" 
positionIncrementGap="100">
    >   <analyzer type="index">
    >           <tokenizer class="solr.WhitespaceTokenizerFactory" />
    >           <filter class="solr.StandardFilterFactory" />
    >           <filter class="solr.LowerCaseFilterFactory" />
    >           <filter class="solr.ShingleFilterFactory" maxShingleSize="30" 
outputUnigrams="true"/>
    >   </analyzer>
    > </fieldType>
    > 
    > We have now reconfigured to:
    > 
    > <fieldType name="lowercase_tokens" class="solr.TextField" 
positionIncrementGap="100">
    >   <analyzer type="index">
    >           <tokenizer class="solr.WhitespaceTokenizerFactory" />
    >           <filter class="solr.StandardFilterFactory" />
    >           <filter class="solr.LowerCaseFilterFactory" />
    >           <filter class="solr.ShingleFilterFactory" maxShingleSize="30" 
outputUnigrams="true"/>
    >   </analyzer>
    >   <analyzer type="query">
    >           <tokenizer class="solr.WhitespaceTokenizerFactory" />
    >           <filter class="solr.StandardFilterFactory" />
    >           <filter class="solr.LowerCaseFilterFactory" />
    >           <filter class="solr.LimitTokenCountFilterFactory" 
maxTokenCount="8" consumeAllTokens="false" />
    >           <filter class="solr.ShingleFilterFactory" maxShingleSize="8" 
outputUnigrams="true"/>
    >   </analyzer>
    > </fieldType>
    > 
    > After the reconfiguration, the huge memory effect of the queries in Solr 
7.1 is gone.  We could kill test instances of Solr with a single query in the 
original configuration. After reconfiguration we can run multiple similar 
queries in parallel, and the Solr process responds in 50-150ms with only 
approx. 100MB added to the heap.
    > 
    > This may well be sufficient for our purposes, as I don't think end users 
will notice the difference in practice & queries that were previously failing 
now return normally.
    > 
    > However I am still curious as to how this performs so differently in Solr 
4.6 - the performance in 4.6 without reconfiguration is very similar to Solr 
7.1 after the reconfiguration.  It is almost as if something within Solr 4.6 is 
causing it to behave as though the number of tokens is limited (although I can 
see in the admin pages for Solr 4.6 that the query and index analyser setup 
both have original config with maxShingleSize=30 setting).  Do you have any 
thoughts about this?
    > 
    > 
    > Many Thanks,
    > Neil
    > 
    > On 20/03/2019, 16:13, "Erick Erickson" <erickerick...@gmail.com> wrote:
    > 
    >    The Apache mail server aggressively strips attachments, so yours 
didn’t come through. People often provide links to images stored somewhere 
else....
    > 
    >    As to why this is behaving this way, I’m pretty clueless. A _complete_ 
shot in the dark is the query parsing changed its default for split on 
whitespace from true to false, perhaps try specifying "&sow=true". Here’s some 
background: 
https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
    > 
    >    I have no actual, you know, _knowledge_ that it’s related but it’d be 
super-easy to try and might give a clue.
    > 
    >    Best,
    >    Erick
    > 
    >> On Mar 20, 2019, at 2:00 AM, Hubert-Price, Neil 
<neil.hubert-pr...@sap.com> wrote:
    >> 
    >> Hello All,
    >> 
    >> We have a recently upgraded system that went from Solr 4.6 to Solr 7.1 
(used as part of an ecommerce application).  In the upgraded version we are 
seeing frequent issues with very high Solr memory usage for certain types of 
query, but the older 4.6 version does not produce the same response.
    >> 
    >> Having taken a heap dump and investigated, we can see instances of 
individual Solr threads where the retained set is 4GB to 5GB in size.  Drilling 
into this we can see a particular subquery with over 500,000 clauses.  
Screenshots below are from Eclipse MAT viewing a heap dump from the SOLR 
process. Observations of the 4.6 version we can see memory increments of 
100-200 MB for the same query, rather than 4-5 GB.
    >> 
    >> In both systems the index has around 2 million documents, with average 
size around 8KB.
    >> 
    >> 
    >> 
    >> 
    >> 
    >> 
    >> 
    >> The subquery with a very large set of clauses relates to a particular 
field setup to use ShingleFilter (with maxShingleSize=30, and 
outputUnigrams=true). Schema.xml definitions for this field are:
    >> 
    >> <fieldType name="lowercase_tokens" class="solr.TextField" 
positionIncrementGap="100">
    >>                                <analyzer type="index">
    >>                                                <tokenizer 
class="solr.WhitespaceTokenizerFactory" />
    >>                                                <filter 
class="solr.StandardFilterFactory" />
    >>                                                <filter 
class="solr.LowerCaseFilterFactory" />
    >>                                                <filter 
class="solr.ShingleFilterFactory" maxShingleSize="30" outputUnigrams="true"/>
    >>                                </analyzer>
    >>                </fieldType>
    >> 
    >>                <field name="productdetails_tokens_en" 
type="lowercase_tokens" indexed="true" stored="false" multiValued="true"/>
    >> 
    >>                <copyField source="supercategoryname_text_en" 
dest="productdetails_tokens_en" />
    >>                <copyField source="supercategorydescription_text_en" 
dest="productdetails_tokens_en" />
    >>                <copyField source="productNameAndDescription_text_en" 
dest="productdetails_tokens_en" />
    >>                <copyField source="code_string" 
dest="productdetails_tokens_en" />
    >> 
    >> The issue happens when the user search contains large numbers of tokens. 
 In the example screenshots above the user search text had 20 tokens. The Solr 
query for that thread was as below (formatting/indentation added by me, the 
original is one long string).  This specific query contains tabs, however the 
same behaviour happens when spaces are used as well:
    >> (
    >> +(
    >>  fulltext_en:(9611444500            9611444520       9611444530       
9611444540       9611414550 9612194002                9612194002       
9612194002       9612194003       9612194007 9611416470             9611416470  
     9611416470                9611416480       9611416480 9613484402           
  9613484402       9613484402       9613484402       9613484402)
    >>  OR productdetails_tokens_en:(9611444500       9611444520       
9611444530       9611444540       9611414550 9612194002       9612194002       
9612194002       9612194003       9612194007 9611416470             9611416470  
              9611416470       9611416480       9611416480 9613484402           
  9613484402       9613484402       9613484402                9613484402)
    >>  OR codePartial:(9611444500     9611444520       9611444530       
9611444540       9611414550 9612194002                9612194002       
9612194002       9612194003       9612194007 9611416470             9611416470  
     9611416470                9611416480       9611416480 9613484402           
  9613484402       9613484402       9613484402       9613484402)
    >> )
    >> )
    >> AND
    >> (
    >> (
    >>  (
    >>   (productChannelVisibility_string_mv:ALL OR 
productChannelVisibility_string_mv:EBUSINESS OR 
productChannelVisibility_string_mv:INTERNET OR 
productChannelVisibility_string_mv:INTRANET)
    >>   AND
    >>   !productChannelVisibility_string_mv:NOTVISIBLE
    >>  )
    >>  AND
    >>  (
    >>   +(
    >>    fulltext_en:(9611444500          9611444520       9611444530       
9611444540       9611414550 9612194002                9612194002       
9612194002       9612194003       9612194007 9611416470             9611416470  
     9611416470                9611416480       9611416480 9613484402           
  9613484402       9613484402       9613484402       9613484402)
    >>    OR productdetails_tokens_en:(9611444500     9611444520       
9611444530       9611444540       9611414550 9612194002       9612194002       
9612194002       9612194003       9612194007 9611416470             9611416470  
              9611416470       9611416480       9611416480 9613484402           
  9613484402       9613484402       9613484402                9613484402)
    >>    OR codePartial:(9611444500  9611444520       9611444530       
9611444540       9611414550 9612194002                9612194002       
9612194002       9612194003       9612194007 9611416470             9611416470  
     9611416470                9611416480       9611416480 9613484402           
  9613484402       9613484402       9613484402       9613484402)
    >>   )
    >>  )
    >> )
    >> )
    >> 
    >> In the heap dump we can see the subqueries relating to 
fulltext_en/codePartial fields both have just 20 clauses.  However the two 
subqueries relating to productdetails_tokens_en both have 524288 clauses & each 
of those clauses is a subquery with up to 20 clauses (each of which seems to be 
a different shingled combination of the original tokens). For example, 
selecting an arbitrary single entry from the 524288 clauses, we can see a 
subquery with the following clauses:
    >> 
    >> Occur.MUST, productdetails_tokens_en: 9611444500
    >> Occur.MUST, productdetails_tokens_en: 9611416470 9611416480
    >> Occur.MUST, productdetails_tokens_en: 9611444520
    >> Occur.MUST, productdetails_tokens_en: 9611444540
    >> Occur.MUST, productdetails_tokens_en: 9612194007
    >> Occur.MUST, productdetails_tokens_en: 9611444530
    >> Occur.MUST, productdetails_tokens_en: 9612194002 9612194002
    >> Occur.MUST, productdetails_tokens_en: 9612194002
    >> Occur.MUST, productdetails_tokens_en: 9611416480
    >> Occur.MUST, productdetails_tokens_en: 9611416470
    >> Occur.MUST, productdetails_tokens_en: 9613484402
    >> Occur.MUST, productdetails_tokens_en: 9612194003
    >> Occur.MUST, productdetails_tokens_en: 9611414550
    >> Occur.MUST, productdetails_tokens_en: 9613484402 9613484402 9613484402   
         
    >> 
    >> 
    >> So the question has two parts:
    >> -          Is this the observed behaviour expected in Solr 7.1 given the 
setup/query described above? (It seems to me that the answer is probably yes, 
because this is the purpose of the ShingleFilter)
    >> -          Why is the same behaviour not in evidence in Solr 4.6?  Are 
there major differences with the way that the query is constructed in the 
earlier version.  If so, can we change Solr 7.1 config to behave more like Solr 
4.6?
    >> 
    >> Many Thanks,
    >> Neil
    >> 
    >> 
    >> 
    >> 
    >> Neil Hubert-Price
    >> Senior Consultant, SAP CX Success and Services, Northern Europe
    >> 
    >> neil.hubert-pr...@sap.com
    >> M: +44 7788 368767
    >> 
    >> 
    >> SAP (UK) Limited, Registered in England No. 2152073. Registered Office: 
Clockhouse Place, Bedfont Road, Feltham, Middlesex, TW14 8HD
    > 
    > 
    >

Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

Reply via email to