My company has several Solrcloud environments. In our most active cloud we are seeing outages that are related to GC pauses. We have about 10 collections of which 4 get a lot of traffic. The solrcloud consists of 4 nodes with 6 processors and 11Gb heap size (25Gb physical memory).
I notice that the 4 nodes seem to do their garbage collection at almost the same time. That seems strange to me. I would expect them to be more staggered. This morning we had a GC pause that caused problems . During that time our application service was reporting "No live SolrServers available to handle this request" Between 3:55 and 3:56 AM all 4 nodes were having some amount of garbage collection pauses, for 2 of the nodes it was minor, for one it was 50%. For 3 nodes it lasted until 3>57. However the node with the worst impact didn't recover until 4am. How is it that all 4 nodes were in lock step doing GC? If they all are doing GC at the same time it defeats the purpose of having redundant cloud servers. We just this weekend switched to use G1GC from CMS At this point in time we also saw that traffic to solr was not well distributed. The application calls solr using CloudSolrClient which I thought did its own load balancing. We saw 10X more traffic going to one solr node that all the others, the we saw it start hitting another node. All solr queries come from our application. During this period of time I saw only 1 error message in the solr log: ERROR (zkConnectionManagerCallback-8-thread-1) [ ] o.a.s.c.ZkController There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props We are currently using Solr 7.7.2 GC Tuning GC_TUNE="-XX:NewRatio=3 \ -XX:SurvivorRatio=4 \ -XX:TargetSurvivorRatio=90 \ -XX:MaxTenuringThreshold=8 \ -XX:+UseG1GC \ -XX:MaxGCPauseMillis=250 \ -XX:+ParallelRefProcEnabled" This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer.