Greetings all, After upgrading some of our Solr clouds to Solr 9.8, we’ve seen increased recovery times and occasional recovery failures for 9.8 clouds with large indexes. Several different exceptions are thrown during recovery failures, but they all seem to have a shared root cause:
Caused by: java.lang.OutOfMemoryError: Direct buffer memory at java.base/java.nio.Bits.reserveMemory(Bits.java:175) ~[?:?] at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118) ~[?:?] at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) ~[?:?] at org.eclipse.jetty.util.BufferUtil.allocateDirect(BufferUtil.java:133) ~[jetty-util-10.0.22.jar:10.0.22] at org.eclipse.jetty.io.ByteBufferPool.newByteBuffer(ByteBufferPool.java:71) ~[jetty-io-10.0.22.jar:10.0.22] at org.eclipse.jetty.io.MappedByteBufferPool.acquire(MappedByteBufferPool.java:159) ~[jetty-io-10.0.22.jar:10.0.22] We’ve observed that for certain indexes, when SolrCloud follower nodes are streaming large index files (~5GB) from a shard leader, the underlying HTTP client, Jetty, can allocate enough direct buffer memory to cause an OOM. In our experience, once a direct buffer OOM happens on a Solr node, the node’s recovery process will be stuck in a failure loop until the JVM’s direct buffer limit is increased. Moreover, Solr is unable to use any Jetty HTTP clients once the direct buffer memory limit is reached. In one case, we had to raise a node’s direct memory limit to over 20GB (using -XX:MaxDirectMemorySize) in order to fully recover an index of size ~200GB from a shard leader. Through Solr’s JVM metrics, we've observed that hundreds of thousands of direct byte buffers (all around 16KB), are being allocated before an OOM occurs. Through those same metrics, we've observed that Solr nodes only use hundreds of buffers during normal operation. We’re still investigating the root cause of this issue, but it has recurred enough for us to be certain that it’s a problem. The issue seems to pop up most frequently with NRT Solr nodes that are copying 100GB+ indexes, and it almost always happens when these nodes are processing updates while in recovery. However, even in extreme cases, these circumstances don’t always cause an OOM, and we’re not exactly sure why. We think that it might have something to do with the structure of the specific index (i.e. a higher number of larger segments is more likely to trigger the issue), but that’s mostly just a guess at this point. It’s worth noting that we haven’t seen any issues with Jetty during normal operation, just in high-throughput situations where Solr is streaming index files while reading and writing to the TLOG. However, Jetty’s direct buffer allocation issues have caused problems with Solr code in the past. There’s already an open Jira for direct buffer memory leaks (https://issues.apache.org/jira/browse/SOLR-17376) and a corresponding Jetty issue (https://github.com/jetty/jetty.project/issues/12084). The OOM problem that we’re seeing now is likely a side effect of SOLR-16505 (https://issues.apache.org/jira/browse/SOLR-16505), with the root cause being SOLR-17376. Within the Solr codebase, we’ve been able to use JFR to determine that when our OOMs happen, most of Jetty’s direct buffer allocations are being triggered by IndexFetcher$FileFetcher.fetchPackets. This is the method that actually reads index files from the Jetty HTTP2 input stream and writes them to a file, which lines up with the observations we’ve made so far. We’d appreciate any community feedback on this issue and how it should be handled moving forwards. Also, even though our problem statement is currently kinda vague, we’d appreciate any help with independently reproducing our IndexFetcher OOM bug, or any similar direct buffer bugs which stem from using Http2SolrClient. Thanks your for time!