Hello All, We have been building a Solr based solution to hold a large amount of data (approx 4 TB/day or > 24 Billion documents per day). We are developing a prototype on a small scale just to evaluate Solr performance gradually. Here is our setup configuration.
Solr cloud: node1: 16 GB RAM, 8 Core CPU, 1TB disk node2: 16 GB RAM, 8 Core CPU, 1TB disk Zookeeper is also installed on above 2 machines in cluster mode. Solr commit intervals: Soft commit 3 minutes, Hard commit 15 seconds Schema: Basic configuration. 5 fields indexed (out of one is text_general), 6 fields stored. Collection: 12 shards (6 per node) Heap memory: 4 GB per node Disk cache: 12 GB per node Document is a syslog message. Documents are being ingested into Solr from different nodes. 12 SolrJ clients ingest data into the Solr cloud. We are experiencing issues when we keep the setup running for long time and after processing around 100 GB of index size (I.e. Around 600 Million documents). Note that we are only indexing the data and not querying it. So there should not be any query overhead. From the VM analysis we figured out that over time the disk operations starts declining and so does the CPU, RAM and Network usage of the Solr nodes. We concluded that Solr is unable to handle one big collection due to index read/write overhead and most of the time it ends up doing only the commit (evident in Solr logs). And because of that indexing is getting hampered (?) So we thought of creating small sized collections instead of one big collection anticipating the commit performance might improve. But eventually the performance degrades even with that and we observe more or less similar charts for CPU, memory, disk and network. To put forth some stats here are the number of documents processed every hour 1St hour: 250 million 2nd hour: 250 million 3rd hour: 240 million 4th hour: 200 million . . 11th hour: 80 million Could you please help us identifying the root cause of degradation in the performance? Are we doing something wrong with the Solr configuration or the collections/sharding etc? Due to this performance degradation we are currently stuck with Solr. Thank you very much in advance. Prasad Tendulkar