Re: Solr takes time to warm up core with huge data

Erick Erickson Tue, 09 Jun 2020 05:11:10 -0700

I’d ignore the form of the query for the present, I think that’s a red herring.


Start by taking all your sort clauses off. Then add them back one by one (you 
have
to restart Solr between these experiments). My bet: your problem is 
“uninverting” and you’ll see your startup speed get worse the more clauses you 
add.
I don’t expect every field to add equally, ones with more unique values will 
probably
be worse.

Or, if you have the ability, recreate your index and add docValues=true to 
_all_ fields
that you use to sort.

indexed=true is great for searches, i.e. for answering “for term X in field Y, 
what docs contain it?”.

It’s rotten for sorting, though where the question is “for docX, what term is 
in field Y?”

So to sort if indexed=true is all you have, the entire field has to be read 
into memory and
“uninverted”. Basically this is a table scan and build this structure on the 
heap. Which is
a very expensive operation.

Setting docValues=true means that this uninverted structure is built at index 
time and 
serialized to disk. So rather than uninvert the indexed data for a field that’s 
being
used for sorting (or faceting, or grouping, or function queries) on the heap, 
the 
uninverted structure is just read in off disk, which is much, much, much faster.

That also reduces the pressure on heap memory because Lucene keeps most of the 
index in MMapDirectory space, see:
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best,
Erick

> On Jun 8, 2020, at 10:10 PM, Srinivas Kashyap 
> <srini...@bamboorose.com.INVALID> wrote:
> 
> Hi Shawn,
> 
> It's a vague question and I haven't tried it out yet.
> 
> Can I instead mention query as below:
> 
> Basically instead of
> 
> 
> 
> q=*:*&fq=PARENT_DOC_ID:100&fq=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]&fq=PHY_KEY2:"HQ012206"&fq=PHY_KEY1:"BAMBOOROSE"&rows=1000&sort=MODIFY_TS 
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
> asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
> 
> 
> 
> pass
> 
> 
> 
> q=PHY_KEY2:" HQ012206"+AND+PHY_KEY1:" BAMBOOROSE 
> "&fq=PARENT_DOC_ID:100&fq=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]&rows=1000&sort=MODIFY_TS desc,LOGICAL_SECT_NAME asc,TRACK_ID 
> desc,TRACK_INTER_ID asc,PHY_KEY1 asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 
> asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 
> asc,PHY_KEY10 asc,FIELD_NAME asc
> 
> 
> Instead of q=*:* I pass only those fields which I want to retrieve. Will this 
> be faster?
> 
> Related to earlier question:
> We are using 8.4.1 version
> All the fields that I'm using on sorting are all string data type(modify ts 
> date) with indexed=true stored=true
> 
> 
> Thanks,
> Srinivas
> 
> 
> On 05-Jun-2020 9:50 pm, Shawn Heisey <apa...@elyograg.org> wrote:
> On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:
>> q=*:*&fq=PARENT_DOC_ID:100&fq=MODIFY_TS:[1970-01-01T00:00:00Z TO 
>> *]&fq=PHY_KEY2:"HQ012206"&fq=PHY_KEY1:"JACK"&rows=1000&sort=MODIFY_TS 
>> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
>> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 
>> asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
>> 
>> This was the original query. Since there were lot of sorting fields, we 
>> decided to not do on the solr side, instead fetch the query response and do 
>> the sorting outside solr. This eliminated the need of more JVM memory which 
>> was allocated. Every time we ran this query, solr would crash exceeding the 
>> JVM memory. Now we are only running filter queries.
> 
> What Solr version, and what is the definition of each of the fields
> you're sorting on? If the definition doesn't include docValues, then a
> large on-heap memory structure will be created for sorting (VERY large
> with 500 million docs), and I wouldn't be surprised if it's created even
> if it is never used. The definition for any field you use for sorting
> should definitely include docValues. In recent Solr versions, docValues
> defaults to true for most field types. Some field classes, TextField in
> particular, cannot have docValues.
> 
> There's something else to discuss about sort params -- each sort field
> will only be used if ALL of the previous sort fields are identical for
> two documents in the full numFound result set. Having more than two or
> three sort fields is usually pointless. My guess (which I know could be
> wrong) is that most queries with this HUGE sort parameter will never use
> anything beyond TRACK_ID.
> 
>> And regarding the filter cache, it is in default setup: (we are using 
>> default solrconfig.xml, and we have only added the request handler for DIH)
>> 
>> <filterCache class="solr.FastLRUCache"
>> size="512"
>> initialSize="512"
>> autowarmCount="0"/>
> 
> This is way too big for your index, and a prime candidate for why your
> heap requirements are so high. Like I said before, if the filterCache
> on your system actually reaches this max size, it will require 30GB of
> memory JUST for the filterCache on this core. Can you check the admin
> UI to determine what the size is and what hit ratio it's getting? (1.0
> is 100% on the hit ratio). I'd probably start with a size of 32 or 64
> on this cache. With a size of 64, a little less than 4GB would be the
> max heap allocated for the cache. You can experiment... but with 500
> million docs, the filterCache size should be pretty small.
> 
> You're going to want to carefully digest this part of that wiki page
> that I linked earlier. Hopefully email will preserve this link completely:
> 
> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements<https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements>
> 
> Thanks,
> Shawn
> 
> ________________________________
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately 
> by replying to the e-mail, and then delete it without making copies or using 
> it in any way.
> No representation is made that this email or any attachments are free of 
> viruses. Virus scanning is recommended and is the responsibility of the 
> recipient.
> 
> Disclaimer
> 
> The information contained in this communication from the sender is 
> confidential. It is intended solely for use by the recipient and others 
> authorized to receive it. If you are not the recipient, you are hereby 
> notified that any disclosure, copying, distribution or taking action in 
> relation of the contents of this information is strictly prohibited and may 
> be unlawful.
> 
> This email has been scanned for viruses and malware, and may have been 
> automatically archived by Mimecast Ltd, an innovator in Software as a Service 
> (SaaS) for business. Providing a safer and more useful place for your human 
> generated data. Specializing in; Security, archiving and compliance. To find 
> out more visit the Mimecast website.

Re: Solr takes time to warm up core with huge data

Reply via email to