Hello, I am evaluating solr for indexing about 45M product catalog info. Catalog mainly contains title and description which takes most of the space (other attributes are brand, category, price, etc)
The data is stored in cassandra and I am using datastax's solr (DSE 3.0.2) which handles incremental updates. The column family I am indexing is about 50GB in size and solr.data's size is about 15GB for now. *Points of interest in solr config/schema:* 1. schema.xml has a copyField called allText which merges title and description. 2. solconfig has the following config: <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.MMapDirectoryFactory}"/> <indexConfig> <filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="0"/> <queryResultCache class="solr.LRUCache" size="1000000" initialSize="1000000" autowarmCount="100000"/> <documentCache class="solr.LRUCache" size="50000000" initialSize="5000000" autowarmCount="0"/> *Relevancy:* Now, the default "text matching" does not suite our search needs, so I have implemented a wrapper around the Solr API which adds boost queries to the default solr query. For example: Original query: ipod Final Query: allText:ipod^1000, allText:apple^1000, allText:music^950 etc. So as you can see, I construct new query based on related keywords and assign score to those keywords based on relevance. This approach looks good and the results look relevant. But I am having issues with *Solr performance*. *Problems:* The initial training pulls 2000 documents from solr to find the most probable matches and calculates score (PMI/NPMI). This query is extremely slow. Also, a regular query also takes 3-4 seconds. I am running solr currently on just one VM with 12GB RAM and 8GB of Heap space is allocated to solr, the block storage is an SSD. What is the suggested setup for this usecase? My guess is, setting up 4 solr nodes will help, but what is the suggested RAM/heap for this kind of data? And what are the recommended configuration (solrconfig.xml) where I *need to speed up reads*? Also, is there a way I can debug what is going on with solr internally? As you can see, my queries are not that complex, so I don't need to debug my queries but just debug solr and see the troubled pieces in it. Also, I am new to solr, so there anything else which I missed to share which would help debug the problem? -- Thanks, -Utkarsh