Re: Solr4 cluster setup for high performance reads

Otis Gospodnetic Thu, 13 Jun 2013 18:16:43 -0700

Hi,

Hard to tell, but here are some tips:


* Those are massive caches.  Rethink their size.  More specifically,
plug in some monitoring tool and see what you are getting out of them.
 Just today I looked at one Sematext's client's caches - 200K entries,
0 evictions ==> needless waste of JVM heap.  So lower those numbers
and increase only if you are getting evictions.

* &debugQuery=true output will tell you something about timings, etc.

* consider edismax and qf param instead of that field copy stuff, info
on zee Wiki

* back to monitoring - what is your bottleneck?  The query looks
simplistic.  Is it IO? Memory? CPU?  Share some graphs and let's look.

Otis
--
Solr & ElasticSearch Support - http://sematext.com/
Performance Monitoring - http://sematext.com/spm/index.html




On Thu, Jun 13, 2013 at 7:53 PM, Utkarsh Sengar <utkarsh2...@gmail.com> wrote:
> Hello,
>
> I am evaluating solr for indexing about 45M product catalog info. Catalog
> mainly contains title and description which takes most of the space (other
> attributes are brand, category, price, etc)
>
> The data is stored in cassandra and I am using datastax's solr (DSE 3.0.2)
> which handles incremental updates. The column family I am indexing is about
> 50GB in size and solr.data's size is about 15GB for now.
>
> *Points of interest in solr config/schema:*
> 1. schema.xml has a copyField called allText which merges title and
> description.
> 2. solconfig has the following config:
>
> <directoryFactory name="DirectoryFactory"
>                   class="${solr.directoryFactory:solr.MMapDirectoryFactory}"/>
>   <indexConfig>
>
> <filterCache class="solr.FastLRUCache"
>                  size="512"
>                  initialSize="512"
>                  autowarmCount="0"/>
>     <queryResultCache class="solr.LRUCache"
>                      size="1000000"
>                      initialSize="1000000"
>                      autowarmCount="100000"/>
>     <documentCache class="solr.LRUCache"
>                    size="50000000"
>                    initialSize="5000000"
>                    autowarmCount="0"/>
>
>
>
>
> *Relevancy:*
> Now, the default "text matching" does not suite our search needs, so I have
> implemented a wrapper around the Solr API which adds boost queries to the
> default solr query. For example:
>
> Original query: ipod
> Final Query: allText:ipod^1000, allText:apple^1000, allText:music^950 etc.
>
> So as you can see, I construct new query based on related keywords and
> assign score to those keywords based on relevance. This approach looks good
> and the results look relevant.
>
>
> But I am having issues with *Solr performance*.
>
> *Problems:*
> The initial training pulls 2000 documents from solr to find the most
> probable matches and calculates score (PMI/NPMI). This query is extremely
> slow. Also, a regular query also takes 3-4 seconds.
> I am running solr currently on just one VM with 12GB RAM and 8GB of Heap
> space is allocated to solr, the block storage is an SSD.
>
> What is the suggested setup for this usecase?
> My guess is, setting up 4 solr nodes will help, but what is the suggested
> RAM/heap for this kind of data?
> And what are the recommended configuration (solrconfig.xml) where I *need
> to speed up reads*?
>
> Also, is there a way I can debug what is going on with solr internally? As
> you can see, my queries are not that complex, so I don't need to debug my
> queries but just debug solr and see the  troubled pieces in it.
>
> Also, I am new to solr, so there anything else which I missed to share
> which would help debug the problem?
>
> --
> Thanks,
> -Utkarsh

Re: Solr4 cluster setup for high performance reads

Reply via email to