Hello,

I am evaluating solr for indexing about 45M product catalog info. Catalog
mainly contains title and description which takes most of the space (other
attributes are brand, category, price, etc)

The data is stored in cassandra and I am using datastax's solr (DSE 3.0.2)
which handles incremental updates. The column family I am indexing is about
50GB in size and solr.data's size is about 15GB for now.

*Points of interest in solr config/schema:*
1. schema.xml has a copyField called allText which merges title and
description.
2. solconfig has the following config:

<directoryFactory name="DirectoryFactory"
                  class="${solr.directoryFactory:solr.MMapDirectoryFactory}"/>
  <indexConfig>

<filterCache class="solr.FastLRUCache"
                 size="512"
                 initialSize="512"
                 autowarmCount="0"/>
    <queryResultCache class="solr.LRUCache"
                     size="1000000"
                     initialSize="1000000"
                     autowarmCount="100000"/>
    <documentCache class="solr.LRUCache"
                   size="50000000"
                   initialSize="5000000"
                   autowarmCount="0"/>




*Relevancy:*
Now, the default "text matching" does not suite our search needs, so I have
implemented a wrapper around the Solr API which adds boost queries to the
default solr query. For example:

Original query: ipod
Final Query: allText:ipod^1000, allText:apple^1000, allText:music^950 etc.

So as you can see, I construct new query based on related keywords and
assign score to those keywords based on relevance. This approach looks good
and the results look relevant.


But I am having issues with *Solr performance*.

*Problems:*
The initial training pulls 2000 documents from solr to find the most
probable matches and calculates score (PMI/NPMI). This query is extremely
slow. Also, a regular query also takes 3-4 seconds.
I am running solr currently on just one VM with 12GB RAM and 8GB of Heap
space is allocated to solr, the block storage is an SSD.

What is the suggested setup for this usecase?
My guess is, setting up 4 solr nodes will help, but what is the suggested
RAM/heap for this kind of data?
And what are the recommended configuration (solrconfig.xml) where I *need
to speed up reads*?

Also, is there a way I can debug what is going on with solr internally? As
you can see, my queries are not that complex, so I don't need to debug my
queries but just debug solr and see the  troubled pieces in it.

Also, I am new to solr, so there anything else which I missed to share
which would help debug the problem?

-- 
Thanks,
-Utkarsh

Reply via email to