Re: Solr Pagination

Jan Høydahl Mon, 12 Oct 2015 01:06:08 -0700

Salman,

You say that you optimized your index from Admin. You should not do that, 
however strange it sounds.
70M docs on 2 shards means 35M docs per shard. What you do when you call 
optimize is to force Lucene
to merge all those 35M docs into ONE SINGLE index segment. You get better HW 
utilization if you let
Lucene/Solr automatically handle merging, meaning you’ll have around 10 smaller 
segments that are faster to
search across than one huge segment.


Your cache settings are way too high. Remember “size” here is number of 
*entries* not number of bytes.
Start with, say, 100 - and then let the system run for a while with realistic 
query load, and then
determine based on the cache statistics whether you have a high hit rate (the 
cache is useful) and
a high eviction rate (could indicate that you would benefit from an increase).

I would not concern myself with high paging offsets unless there is something 
very special about your
usecase which justifies this as a usecase to focus much energy on. People just 
don’t page beyond page 10 :)
and if they do you should focus on improving the relevancy first - unless you 
got a very special use case...

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 11. okt. 2015 kl. 06.54 skrev Shawn Heisey <apa...@elyograg.org>:
> 
> On 10/10/2015 2:55 AM, Salman Ansari wrote:
>> Thanks Shawn for your response. Based on that
>> 1) Can you please direct me where I can get more information about cold
>> shard vs hot shard?
> 
> I don't know of any information out there about hot/cold shards.  I can
> describe it, though:
> 
> A split point is determined.  Everything older than the split point gets
> divided by some method (usually hashing) between multiple cold shards.
> Everything newer than the split point goes into the hot shard.  For my
> index, there is only one hot shard, but it is possible to have multiple
> hot shards.
> 
> On some interval (nightly in my index), the split point is adjusted and
> documents are moved from the hot shard to the cold shards according to
> that split point.  The hot shard is typically a lot smaller than the
> cold shards, which helps increase indexing speed for new documents.
> 
> I am not using SolrCloud. I manage all my own sharding. There is no
> capability included in SolrCloud that can do hot/cold sharding.
> 
>> 2)  That 10GB number assumes there's no other software on the machine, like
>> a database server or a webserver.
>> Yes the machine is dedicated for Solr
>> 
>> 3) How much index data is on the machine?
>> I have 3 collections 2 for testing (so the aggregate of both of them does
>> not exceed 1M document) and the main collection that I am querying now
>> which contains around 69M. I have distributed all my collections into 2
>> shards each with 2 replicas. The consumption on the hard disk is about 40GB.
> 
> That sounds like a recipe for a performance problem, although I am not
> certain why the problem persisted after increasing the memory.  Perhaps
> it has something to do with the filterCache, which I will get to further
> down.
> 
>> 4) A memory size of 14GB would be unusual for a physical machine, and makes 
>> me
>> wonder if you're using virtual machines
>> Yes I am using virtual machine as using a bare metal will be difficult in
>> my case as all of our data center is on the cloud. I can increase its
>> capacity though. While testing some edge cases on Solr, I realized on Solr
>> admin that the memory sometimes reaches to its limit (14GB RAM, and 4GB JVM)
> 
> This is how operating systems and Java are designed to work.  When
> things are running well, all of physical memory might be allocated, and
> the heap will become full on a semi-regular basis.  If it *stays* full,
> that usually means it needs to be larger.  The admin UI is a poor tool
> for watching JVM memory usage.
> 
>> 5) Just to confirm, I have combined the lessons from
>> 
>> http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
>> AND
>> https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
>> 
>> to come up with the following settings
>> 
>> FilterCache
>> 
>>    <filterCache class="solr.FastLRUCache"
>>                 size="16384"
>>                 initialSize="4096"
>>                 autowarmCount="4096"/>
> 
> That's a very very large cache size.  It is likely to use a VERY large
> amount of heap, and autowarming up to 4096 entries at commit time might
> take many *minutes*.  Each filterCache entry is maxDoc/8 bytes.  On an
> index core with 70 million documents, each filterCache entry is at least
> 8.75 million bytes.  Multiply that by 16384, and a completely full cache
> would need about 140GB of heap memory.  4096 entries will require 35GB.
> I don't think this cache is actually storing that many entries, or you
> would most certainly be running into OutOfMemoryError exceptions.
> 
>>    <documentCache class="solr.LRUCache"
>>                   size="16384"
>>                   initialSize="16384"
>>                   autowarmCount="0"/>
>> 
>> NewSearcher and FirsSearcher
>> 
>> <listener event="newSearcher" class="solr.QuerySenderListener">
>>      <arr name="queries">
>>           <lst><str name="q">*</str><str name="sort">score desc id
>> desc</str></lst>
>>      </arr>
>>    </listener>
>>    <listener event="firstSearcher" class="solr.QuerySenderListener">
>>      <arr name="queries">
>> <lst> <str name="q">*</str> <str name="sort">score desc id desc</str> </lst>
>>        <!-- seed common facets and filter queries -->
>>        <lst> <str name="q">*</str>
>>              <str name="facet.field">category</str>        </lst>
>>      </arr>
>>    </listener>
>> 
>> Will this be using more cache in Solr and prepoupulate it?
> 
> The newSearcher entry will result in one entry in the queryResultCache,
> and an unknown number of entries in the documentCache -- that depends on
> the "rows" parameter on the /select handler (defaults to 10) and the
> queryResultMaxDocsCached parameter.
> 
> The firstSearcher entry does two queries, but because the "q" parameter
> is identical on them, it will only result in one entry in the
> queryResultCache.  One of them has facet.field, but you did not include
> facet=true, so the facet query will not actually be run.  Without the
> facet query, the filterCache will not be populated.
> 
> I think the design intent for newSearcher and firstSearcher is to load
> critical index data into the OS disk cache.  It's not so much about
> warming the Solr caches as it is about priming the system as a whole.
> 
> Note that the wildcard query you are running (q=*) is relatively slow,
> but is an excellent choice for a warming query, because it actually
> reads every single term from the default field.  Because of how slow
> this query can run, setting useColdSearcher to true is recommended.
> 
> Thanks,
> Shawn
>

Re: Solr Pagination

Reply via email to