Hi, Hard to tell, but here are some tips:
* Those are massive caches. Rethink their size. More specifically, plug in some monitoring tool and see what you are getting out of them. Just today I looked at one Sematext's client's caches - 200K entries, 0 evictions ==> needless waste of JVM heap. So lower those numbers and increase only if you are getting evictions. * &debugQuery=true output will tell you something about timings, etc. * consider edismax and qf param instead of that field copy stuff, info on zee Wiki * back to monitoring - what is your bottleneck? The query looks simplistic. Is it IO? Memory? CPU? Share some graphs and let's look. Otis -- Solr & ElasticSearch Support - http://sematext.com/ Performance Monitoring - http://sematext.com/spm/index.html On Thu, Jun 13, 2013 at 7:53 PM, Utkarsh Sengar <utkarsh2...@gmail.com> wrote: > Hello, > > I am evaluating solr for indexing about 45M product catalog info. Catalog > mainly contains title and description which takes most of the space (other > attributes are brand, category, price, etc) > > The data is stored in cassandra and I am using datastax's solr (DSE 3.0.2) > which handles incremental updates. The column family I am indexing is about > 50GB in size and solr.data's size is about 15GB for now. > > *Points of interest in solr config/schema:* > 1. schema.xml has a copyField called allText which merges title and > description. > 2. solconfig has the following config: > > <directoryFactory name="DirectoryFactory" > class="${solr.directoryFactory:solr.MMapDirectoryFactory}"/> > <indexConfig> > > <filterCache class="solr.FastLRUCache" > size="512" > initialSize="512" > autowarmCount="0"/> > <queryResultCache class="solr.LRUCache" > size="1000000" > initialSize="1000000" > autowarmCount="100000"/> > <documentCache class="solr.LRUCache" > size="50000000" > initialSize="5000000" > autowarmCount="0"/> > > > > > *Relevancy:* > Now, the default "text matching" does not suite our search needs, so I have > implemented a wrapper around the Solr API which adds boost queries to the > default solr query. For example: > > Original query: ipod > Final Query: allText:ipod^1000, allText:apple^1000, allText:music^950 etc. > > So as you can see, I construct new query based on related keywords and > assign score to those keywords based on relevance. This approach looks good > and the results look relevant. > > > But I am having issues with *Solr performance*. > > *Problems:* > The initial training pulls 2000 documents from solr to find the most > probable matches and calculates score (PMI/NPMI). This query is extremely > slow. Also, a regular query also takes 3-4 seconds. > I am running solr currently on just one VM with 12GB RAM and 8GB of Heap > space is allocated to solr, the block storage is an SSD. > > What is the suggested setup for this usecase? > My guess is, setting up 4 solr nodes will help, but what is the suggested > RAM/heap for this kind of data? > And what are the recommended configuration (solrconfig.xml) where I *need > to speed up reads*? > > Also, is there a way I can debug what is going on with solr internally? As > you can see, my queries are not that complex, so I don't need to debug my > queries but just debug solr and see the troubled pieces in it. > > Also, I am new to solr, so there anything else which I missed to share > which would help debug the problem? > > -- > Thanks, > -Utkarsh