For us we have ~ 350M documents stored using r3.xlarge nodes with 8GB Heap and about 31GB of RAM
We are using Solr 5.3.1 in a SolrCloud setup (3 collections, each with 3 shards and 3 replicas). For us lots of RAM memory is not as important as CPU (as the EBS disk we run on top of is quite fast and our memory hit rate is quite low). Some things that helped 1) Turned off the filter cache (it required too much heap) 2) Set a limit on replication bandwidth (when nodes are recovering they can tie up a lot of CPU) in particular maxWriteMBPerSec=100 3) Set query timeout to 2 seconds to help kill ³heavy² queries 4) Set preferLocalShards=true to help mitigate when any EC2 nodes are having a ³noisy neighbor" 5) We implemented our own CloudWatch based monitoring so that when Solr VM CPU is high (> 90%) we queue up indexing traffic rather than send it to be indexed. We found that if you peg Solr CPU for too long replicas can¹t keep up, they go into recovery, which drives CPU even higher and eventually the cluster thinks the nodes are ³down² when they repeatedly fail at recovery. So we really try to manage Solr CPU load (We¹ll probably look to switching to compute optimized nodes in the future) Best -Frank On 4/3/18, 9:12 PM, "Steven White" <swhite4...@gmail.com> wrote: >Hi everyone, > >I'm about to start a project that requires indexing 36 million records >using Solr 7.2.1. Each record range from 500 KB to 0.25 MB where the >average is 0.1 MB. > >Has anyone indexed this number of records? What are the things I should >worry about? And out of curiosity, what is the largest number of records >that Solr has indexed which is published out there? > >Thanks > >Steven