We are still having serious problems with our solrcloud failing due to this 
problem.
The problem is clearly data related. 
How can I determine what documents are being searched? Is it possible to get 
Solr/lucene to output the docids being searched?

I believe that this is a lucene bug, but I need to narrow the focus to a 
smaller number of records, and I'm not certain how to do that efficiently. Are 
there debug parameters that could help?

-----Original Message-----
From: Webster Homer <webster.ho...@milliporesigma.com> 
Sent: Thursday, December 20, 2018 3:45 PM
To: solr-user@lucene.apache.org
Subject: Query kills Solrcloud

We are experiencing almost nightly solr crashes due to Japanese queries. I’ve 
been able to determine that one of our field types seems to be a culprit. When 
I run a much reduced version of the query against out DEV solrcloud I see the 
memory usage jump from less than a gb to 5gb using only a single field in the 
query. The collection is fairly small ~411,000 documents of which only ~25,000 
have searchable Japanese fields. I have been able to simplify the query to run 
against a single Japanese field in the schema. The JVM memory jumps from less 
than a gig to close to 5 gb, and back down. The QTime is 36959 which seems high 
for 2500 documents. Indeed the single field that I’m using in my test case has 
2031 documents.

I extended the query to 5 fields and watch the memory usage in the Solr Console 
application. The memory usage goes to almost 6gb with a QTime of 100909. The 
Solrconsole shows connection errors, and when I look at the Cloud graph all the 
replicas on the node where I submitted the query are down. In dev the replicas 
eventually recover. In production, with the full query which has a lot more 
fields in the qf parameter, the solr cloud dies.
One example query term:
ジエチルアミノヒドロキシベンゾイル安息香酸ヘキシル

This is the field type that we have defined:
   <fieldtype name="text_deep_cjk" class="solr.TextField" 
positionIncrementGap="10000" autoGeneratePhraseQueries="false">
     <analyzer type="index">
        <!-- remove spaces between CJK characters -->
       <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
 replacement="$1"/>
         <tokenizer class="solr.ICUTokenizerFactory" />
        <!-- normalize width before bigram, as e.g. half-width dakuten combine  
-->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Transform Traditional Han to Simplified Han -->
        <filter class="solr.ICUTransformFilterFactory" 
id="Traditional-Simplified"/>
                <!-- Transform Hiragana to Katakana just as was done for Endeca 
-->
        <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case 
folding, diacritics removed -->
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" 
katakana="true" hangul="true" outputUnigrams="true" />
      </analyzer>

     <analyzer type="query">
         <!-- remove spaces between CJK characters -->
       <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
 replacement="$1"/>

       <tokenizer class="solr.ICUTokenizerFactory" />
       <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true" tokenizerFactory="solr.ICUTokenizerFactory" />
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Transform Traditional Han to Simplified Han -->
        <filter class="solr.ICUTransformFilterFactory" 
id="Traditional-Simplified"/>
                <!-- Transform Hiragana to Katakana just as was done for Endeca 
-->
        <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case 
folding, diacritics removed -->
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" 
katakana="true" hangul="true" outputUnigrams="true" />
      </analyzer>
    </fieldtype>

Why is searching even 1 field of this type so expensive?
I suspect that this is data related, as other queries return in far less than a 
second. What are good strategies for determining what documents are causing the 
problem? I’m new to debugging Solr so I could use some help. I’d like to reduce 
the number of records to a minimum to create a small dataset to reproduce the 
problem.
Right now our only option is to stop using this fieldtype, but it does improve 
the relevancy of searches that don’t cause Solr to crash.

It would be a great help if the Solrconsole would not timeout on these queries, 
is there a way to turn off the timeout?
We are running Solr 7.2

Reply via email to