I'm still pretty clueless trying to find the root cause of this behavior. One thing is pretty consistent that whenever a node restarts up and sends a recovery command, the recipient shard/replica goes down due to sudden surge in old gen heap space. Within minutes, it hits the ceiling and stall the server. And this keeps one going in circles. After moving to 7.5, we decided to switch to G1 from CMS. We are using the recommended settings from Shawn's blog.
GC_TUNE="-XX:+UseG1GC \ -XX:+PerfDisableSharedMem \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=8m \ -XX:MaxGCPauseMillis=250 \ -XX:InitiatingHeapOccupancyPercent=75 \ -XX:+UseLargePages \ -XX:+AggressiveOpts \ -XX:OnOutOfMemoryError=/mnt/ebs2/solrhome/bin/oom_solr.sh" Can this be tuned better to avoid this? Also, I'm curios to know if any 7.5 user has experienced similar scenario. Can there be some major change related to recovery that I might be missing after porting from 6.6? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html