Re:Jetty HTTP2 causing Solr Direct Buffer Memory OOMs

Jude Muriithi (BLOOMBERG/ 919 3RD A) Thu, 22 May 2025 15:52:45 -0700

For some reason the paragraphs of this post got smashed together after I 
submitted it to the list. My apologies!


From: dev@solr.apache.org At: 05/22/25 18:44:38 UTC-4:00To:  dev@solr.apache.org
Subject: Jetty HTTP2 causing Solr Direct Buffer Memory OOMs

Greetings all,
After upgrading some of our Solr clouds to Solr 9.8, we’ve seen increased 
recovery times and occasional recovery failures for 9.8 clouds with large 
indexes. Several different exceptions are thrown during recovery failures, but 
they all seem to have a shared root cause:

Caused by: java.lang.OutOfMemoryError: Direct buffer memory
      at java.base/java.nio.Bits.reserveMemory(Bits.java:175) ~[?:?]
       at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118) 
~[?:?]
        at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) 
~[?:?]
  at org.eclipse.jetty.util.BufferUtil.allocateDirect(BufferUtil.java:133) 
~[jetty-util-10.0.22.jar:10.0.22]
   at org.eclipse.jetty.io.ByteBufferPool.newByteBuffer(ByteBufferPool.java:71) 
~[jetty-io-10.0.22.jar:10.0.22]
 at 
org.eclipse.jetty.io.MappedByteBufferPool.acquire(MappedByteBufferPool.java:159)
 ~[jetty-io-10.0.22.jar:10.0.22]
We’ve observed that for certain indexes, when SolrCloud follower nodes are 
streaming large index files (~5GB) from a shard leader, the underlying HTTP 
client, Jetty, can allocate enough direct buffer memory to cause an OOM. In our 
experience, once a direct buffer OOM happens on a Solr node, the node’s 
recovery process will be stuck in a failure loop until the JVM’s direct buffer 
limit is increased. Moreover, Solr is unable to use any Jetty HTTP clients once 
the direct buffer memory limit is reached. In one case, we had to raise a 
node’s direct memory limit to over 20GB (using -XX:MaxDirectMemorySize) in 
order to fully recover an index of size ~200GB from a shard leader. Through 
Solr’s JVM metrics, we've observed that hundreds of thousands of direct byte 
buffers (all around 16KB), are being allocated before an OOM occurs. Through 
those same metrics, we've observed that Solr nodes only use hundreds of buffers 
during normal operation.
We’re still investigating the root cause of this issue, but it has recurred 
enough for us to be certain that it’s a problem. The issue seems to pop up most 
frequently with NRT Solr nodes that are copying 100GB+ indexes, and it almost 
always happens when these nodes are processing updates while in recovery. 
However, even in extreme cases, these circumstances don’t always cause an OOM, 
and we’re not exactly sure why. We think that it might have something to do 
with the structure of the specific index (i.e. a higher number of larger 
segments is more likely to trigger the issue), but that’s mostly just a guess 
at this point.
It’s worth noting that we haven’t seen any issues with Jetty during normal 
operation, just in high-throughput situations where Solr is streaming index 
files while reading and writing to the TLOG. However, Jetty’s direct buffer 
allocation issues have caused problems with Solr code in the past. There’s 
already an open Jira for direct buffer memory leaks 
(https://issues.apache.org/jira/browse/SOLR-17376) and a corresponding Jetty 
issue (https://github.com/jetty/jetty.project/issues/12084). The OOM problem 
that we’re seeing now is likely a side effect of SOLR-16505 
(https://issues.apache.org/jira/browse/SOLR-16505), with the root cause being 
SOLR-17376.
Within the Solr codebase, we’ve been able to use JFR to determine that when our 
OOMs happen, most of Jetty’s direct buffer allocations are being triggered by 
IndexFetcher$FileFetcher.fetchPackets. This is the method that actually reads 
index files from the Jetty HTTP2 input stream and writes them to a file, which 
lines up with the observations we’ve made so far. 
We’d appreciate any community feedback on this issue and how it should be 
handled moving forwards. Also, even though our problem statement is currently 
kinda vague, we’d appreciate any help with independently reproducing our 
IndexFetcher OOM bug, or any similar direct buffer bugs which stem from using 
Http2SolrClient.
Thanks your for time!

Re:Jetty HTTP2 causing Solr Direct Buffer Memory OOMs

Reply via email to