Re: "solr start -e techproducts" failing on MacOS

2025-05-22 Thread Rahul Goswami
Great find Kevin! That makes sense. Can confirm that  starting with
--jvm-opts "-Djava.io.tmpdir=$(cd $TMPDIR; pwd -P)" works too. Also, thanks
for the PR!

I'd say that fixing this by passing "pwd -P" in the start opts in bin/solr
_seems_ like the right way to go. But I am also conflicted on whether
disabling security manager altogether is the solution that could be
considered here (?). Especially since it's going away in Java 24 anyway.

Rahul

On Wed, May 21, 2025 at 10:07 PM Kevin Risden  wrote:

> The underlying issue is that /tmp is a symlink on Mac. Java security
> manager permissions need the ability to read the symlink AND the underlying
> directory. Since we only have
>
> permission java.io.FilePermission "${java.io.tmpdir}", "read,write";
> permission java.io.FilePermission "${java.io.tmpdir}${/}-",
> "read,write,delete";
>
> in security.policy it by default is just the symlink.
>
> We actually do similar fixes in our bin/solr script already using `pwd -P`
> to ensure that we don't have symlinks in the path. See SOLR-16457 /
> https://github.com/apache/solr/pull/1282
>
> an example that works:
>
> ./bin/solr start -f --jvm-opts "-Djava.io.tmpdir=$(cd $TMPDIR; pwd -P)"
>
> This takes the existing TMPDIR environment variable and forces
> `java.io.tmpdir` to not be a symlink anymore using the `pwd -P` expansion.
>
> This can be done in bin/solr as well if we want and set tmpdir in say
> SOLR_START_OPTS. Here is a PR to show how this could be done
> https://github.com/apache/solr/pull/3359
>
> As you already found out you can also just disable the security manager -
> here is a one liner that doesn't require changing any files either.
>
> SOLR_SECURITY_MANAGER_ENABLED=false ./bin/solr start -f
>
> As a final note this is not new to Jetty 12 but has been an issue in the
> past - see https://issues.apache.org/jira/browse/SOLR-17542. It might be
> new how its popping up now with Jetty 12 doing something with the temp
> directory but there are other ways to hit it.
>
> Kevin Risden
>
>
> On Wed, May 21, 2025 at 5:52 PM Rahul Goswami 
> wrote:
>
> > That worked. Thanks Christos!
> >
> > On Wed, May 21, 2025 at 5:30 PM Christos Malliaridis <
> > malliari...@apache.org>
> > wrote:
> >
> > > I have faced the same issue recently.
> > >
> > > There is a configuration option in bin/solr.in.sh 
> > for
> > > the security manager:
> > >
> > > #SOLR_SECURITY_MANAGER_ENABLED=true
> > >
> > > Removing the comment and setting it to false worked for me.
> > >
> > > On Wed, 21 May 2025, 23:19 Rahul Goswami, 
> wrote:
> > >
> > > > Sanjay,
> > > > Thanks for looking into this. I also tried disabling the security
> > manager
> > > > on MacOS by running "solr start -e techproducts
> > > > -Djdk.security.manager=disallow" and still see the same behavior with
> > the
> > > > same stacktrace in log.
> > > >
> > > > -Rahul
> > > >
> > > > On Tue, May 20, 2025 at 11:55 PM sanjay dutt <
> > > > sanjaydutt.unoffic...@gmail.com> wrote:
> > > >
> > > > > Recently I merged changes related to jetty upgrade. I will look
> into
> > > it.
> > > > > https://github.com/apache/solr/pull/2876
> > > > >
> > > > > On Wed, May 21, 2025 at 5:44 AM Rahul Goswami <
> rahul196...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > For additional context, this is working fine on Windows. Failing
> > > > > > consistently on MacOS.
> > > > > >
> > > > > > Thanks,
> > > > > > Rahul
> > > > > >
> > > > > >
> > > > > > On Tue, May 20, 2025 at 5:32 PM Rahul Goswami <
> > rahul196...@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I checked out main, and then "gradlew dev". Tried running the
> > > > > > techproducts
> > > > > > > example, but it seems to get into some exception with
> > initializing
> > > > the
> > > > > > > context (hitting 503 error).
> > > > > > >
> > > > > > >
> > > > > > > Logs are complaining about AccessControlException while trying
> to
> > > > > access
> > > > > > a
> > > > > > > temp location. I have tried the example flow in the past and
> > never
> > > > hit
> > > > > > this
> > > > > > > permission issue. Needless to say, bats tests are failing too.
> > > > > > >
> > > > > > >
> > > > > > > Is anybody else seeing this too? Bad commit or an environmental
> > > > issue?
> > > > > > > Thanks for any insights.
> > > > > > >
> > > > > > >
> > > > > > > rahulgoswami@MacBookPro bin % ./solr start -e techproducts
> > > > > > >
> > > > > > > *** [WARN] ***  Your Max Processes Limit is currently 2784.
> > > > > > >
> > > > > > >  It should be set to 65000 to avoid operational disruption.
> > > > > > >
> > > > > > >  If you no longer wish to see this warning, set
> > SOLR_ULIMIT_CHECKS
> > > to
> > > > > > > false in your profile or solr.in.sh
> > > > > > >
> > > > > > >
> > > > > > > Starting up Solr on port 8983 using command:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> "/Users/rahulgoswami/Desktop/OpenSource_Repos/Solr-RG/solr/solr/packaging/build/dev/bin/solr"
> > > > 

Re: "solr start -e techproducts" failing on MacOS

2025-05-22 Thread Kevin Risden
Its not just a main / Jetty 12 issue. This should be backported to 9.x as
well where security manager isn't going away.

Kevin Risden


On Thu, May 22, 2025 at 11:22 AM Rahul Goswami 
wrote:

> Great find Kevin! That makes sense. Can confirm that  starting with
> --jvm-opts "-Djava.io.tmpdir=$(cd $TMPDIR; pwd -P)" works too. Also, thanks
> for the PR!
>
> I'd say that fixing this by passing "pwd -P" in the start opts in bin/solr
> _seems_ like the right way to go. But I am also conflicted on whether
> disabling security manager altogether is the solution that could be
> considered here (?). Especially since it's going away in Java 24 anyway.
>
> Rahul
>
> On Wed, May 21, 2025 at 10:07 PM Kevin Risden  wrote:
>
> > The underlying issue is that /tmp is a symlink on Mac. Java security
> > manager permissions need the ability to read the symlink AND the
> underlying
> > directory. Since we only have
> >
> > permission java.io.FilePermission "${java.io.tmpdir}", "read,write";
> > permission java.io.FilePermission "${java.io.tmpdir}${/}-",
> > "read,write,delete";
> >
> > in security.policy it by default is just the symlink.
> >
> > We actually do similar fixes in our bin/solr script already using `pwd
> -P`
> > to ensure that we don't have symlinks in the path. See SOLR-16457 /
> > https://github.com/apache/solr/pull/1282
> >
> > an example that works:
> >
> > ./bin/solr start -f --jvm-opts "-Djava.io.tmpdir=$(cd $TMPDIR; pwd -P)"
> >
> > This takes the existing TMPDIR environment variable and forces
> > `java.io.tmpdir` to not be a symlink anymore using the `pwd -P`
> expansion.
> >
> > This can be done in bin/solr as well if we want and set tmpdir in say
> > SOLR_START_OPTS. Here is a PR to show how this could be done
> > https://github.com/apache/solr/pull/3359
> >
> > As you already found out you can also just disable the security manager -
> > here is a one liner that doesn't require changing any files either.
> >
> > SOLR_SECURITY_MANAGER_ENABLED=false ./bin/solr start -f
> >
> > As a final note this is not new to Jetty 12 but has been an issue in the
> > past - see https://issues.apache.org/jira/browse/SOLR-17542. It might be
> > new how its popping up now with Jetty 12 doing something with the temp
> > directory but there are other ways to hit it.
> >
> > Kevin Risden
> >
> >
> > On Wed, May 21, 2025 at 5:52 PM Rahul Goswami 
> > wrote:
> >
> > > That worked. Thanks Christos!
> > >
> > > On Wed, May 21, 2025 at 5:30 PM Christos Malliaridis <
> > > malliari...@apache.org>
> > > wrote:
> > >
> > > > I have faced the same issue recently.
> > > >
> > > > There is a configuration option in bin/solr.in.sh  >
> > > for
> > > > the security manager:
> > > >
> > > > #SOLR_SECURITY_MANAGER_ENABLED=true
> > > >
> > > > Removing the comment and setting it to false worked for me.
> > > >
> > > > On Wed, 21 May 2025, 23:19 Rahul Goswami, 
> > wrote:
> > > >
> > > > > Sanjay,
> > > > > Thanks for looking into this. I also tried disabling the security
> > > manager
> > > > > on MacOS by running "solr start -e techproducts
> > > > > -Djdk.security.manager=disallow" and still see the same behavior
> with
> > > the
> > > > > same stacktrace in log.
> > > > >
> > > > > -Rahul
> > > > >
> > > > > On Tue, May 20, 2025 at 11:55 PM sanjay dutt <
> > > > > sanjaydutt.unoffic...@gmail.com> wrote:
> > > > >
> > > > > > Recently I merged changes related to jetty upgrade. I will look
> > into
> > > > it.
> > > > > > https://github.com/apache/solr/pull/2876
> > > > > >
> > > > > > On Wed, May 21, 2025 at 5:44 AM Rahul Goswami <
> > rahul196...@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > For additional context, this is working fine on Windows.
> Failing
> > > > > > > consistently on MacOS.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Rahul
> > > > > > >
> > > > > > >
> > > > > > > On Tue, May 20, 2025 at 5:32 PM Rahul Goswami <
> > > rahul196...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I checked out main, and then "gradlew dev". Tried running the
> > > > > > > techproducts
> > > > > > > > example, but it seems to get into some exception with
> > > initializing
> > > > > the
> > > > > > > > context (hitting 503 error).
> > > > > > > >
> > > > > > > >
> > > > > > > > Logs are complaining about AccessControlException while
> trying
> > to
> > > > > > access
> > > > > > > a
> > > > > > > > temp location. I have tried the example flow in the past and
> > > never
> > > > > hit
> > > > > > > this
> > > > > > > > permission issue. Needless to say, bats tests are failing
> too.
> > > > > > > >
> > > > > > > >
> > > > > > > > Is anybody else seeing this too? Bad commit or an
> environmental
> > > > > issue?
> > > > > > > > Thanks for any insights.
> > > > > > > >
> > > > > > > >
> > > > > > > > rahulgoswami@MacBookPro bin % ./solr start -e techproducts
> > > > > > > >
> > > > > > > > *** [WARN] ***  Your Max Processes Limit is currently 2784.
> > > > > > > >
> > > > > > 

Re: SolrJ timeouts

2025-05-22 Thread Chris Hostetter



: Then I noticed something insidious:  If an Http2SolrClient (Jetty) is
: created using the builder's withHttpClient(...) method, then the idle (aka
: socket) timeout & connection timeout (and basically any other setting
: that's inside the Jetty HttpClient) cannot *actually* be customized
: henceforth.  This should be pretty obvious, when you think about it (duh).

Deja Vu...

https://issues.apache.org/jira/browse/SOLR-13605

...except now we have the reverse problem?



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org



Jetty HTTP2 causing Solr Direct Buffer Memory OOMs

2025-05-22 Thread Jude Muriithi (BLOOMBERG/ 919 3RD A)
Greetings all,
After upgrading some of our Solr clouds to Solr 9.8, we’ve seen increased 
recovery times and occasional recovery failures for 9.8 clouds with large 
indexes. Several different exceptions are thrown during recovery failures, but 
they all seem to have a shared root cause:

Caused by: java.lang.OutOfMemoryError: Direct buffer memory
  at java.base/java.nio.Bits.reserveMemory(Bits.java:175) ~[?:?]
   at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118) 
~[?:?]
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) 
~[?:?]
  at org.eclipse.jetty.util.BufferUtil.allocateDirect(BufferUtil.java:133) 
~[jetty-util-10.0.22.jar:10.0.22]
   at org.eclipse.jetty.io.ByteBufferPool.newByteBuffer(ByteBufferPool.java:71) 
~[jetty-io-10.0.22.jar:10.0.22]
 at 
org.eclipse.jetty.io.MappedByteBufferPool.acquire(MappedByteBufferPool.java:159)
 ~[jetty-io-10.0.22.jar:10.0.22]
We’ve observed that for certain indexes, when SolrCloud follower nodes are 
streaming large index files (~5GB) from a shard leader, the underlying HTTP 
client, Jetty, can allocate enough direct buffer memory to cause an OOM. In our 
experience, once a direct buffer OOM happens on a Solr node, the node’s 
recovery process will be stuck in a failure loop until the JVM’s direct buffer 
limit is increased. Moreover, Solr is unable to use any Jetty HTTP clients once 
the direct buffer memory limit is reached. In one case, we had to raise a 
node’s direct memory limit to over 20GB (using -XX:MaxDirectMemorySize) in 
order to fully recover an index of size ~200GB from a shard leader. Through 
Solr’s JVM metrics, we've observed that hundreds of thousands of direct byte 
buffers (all around 16KB), are being allocated before an OOM occurs. Through 
those same metrics, we've observed that Solr nodes only use hundreds of buffers 
during normal operation.
We’re still investigating the root cause of this issue, but it has recurred 
enough for us to be certain that it’s a problem. The issue seems to pop up most 
frequently with NRT Solr nodes that are copying 100GB+ indexes, and it almost 
always happens when these nodes are processing updates while in recovery. 
However, even in extreme cases, these circumstances don’t always cause an OOM, 
and we’re not exactly sure why. We think that it might have something to do 
with the structure of the specific index (i.e. a higher number of larger 
segments is more likely to trigger the issue), but that’s mostly just a guess 
at this point.
It’s worth noting that we haven’t seen any issues with Jetty during normal 
operation, just in high-throughput situations where Solr is streaming index 
files while reading and writing to the TLOG. However, Jetty’s direct buffer 
allocation issues have caused problems with Solr code in the past. There’s 
already an open Jira for direct buffer memory leaks 
(https://issues.apache.org/jira/browse/SOLR-17376) and a corresponding Jetty 
issue (https://github.com/jetty/jetty.project/issues/12084). The OOM problem 
that we’re seeing now is likely a side effect of SOLR-16505 
(https://issues.apache.org/jira/browse/SOLR-16505), with the root cause being 
SOLR-17376.
Within the Solr codebase, we’ve been able to use JFR to determine that when our 
OOMs happen, most of Jetty’s direct buffer allocations are being triggered by 
IndexFetcher$FileFetcher.fetchPackets. This is the method that actually reads 
index files from the Jetty HTTP2 input stream and writes them to a file, which 
lines up with the observations we’ve made so far. 
We’d appreciate any community feedback on this issue and how it should be 
handled moving forwards. Also, even though our problem statement is currently 
kinda vague, we’d appreciate any help with independently reproducing our 
IndexFetcher OOM bug, or any similar direct buffer bugs which stem from using 
Http2SolrClient.
Thanks your for time!

Re:Jetty HTTP2 causing Solr Direct Buffer Memory OOMs

2025-05-22 Thread Jude Muriithi (BLOOMBERG/ 919 3RD A)
For some reason the paragraphs of this post got smashed together after I 
submitted it to the list. My apologies!

From: dev@solr.apache.org At: 05/22/25 18:44:38 UTC-4:00To:  dev@solr.apache.org
Subject: Jetty HTTP2 causing Solr Direct Buffer Memory OOMs

Greetings all,
After upgrading some of our Solr clouds to Solr 9.8, we’ve seen increased 
recovery times and occasional recovery failures for 9.8 clouds with large 
indexes. Several different exceptions are thrown during recovery failures, but 
they all seem to have a shared root cause:

Caused by: java.lang.OutOfMemoryError: Direct buffer memory
  at java.base/java.nio.Bits.reserveMemory(Bits.java:175) ~[?:?]
   at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118) 
~[?:?]
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) 
~[?:?]
  at org.eclipse.jetty.util.BufferUtil.allocateDirect(BufferUtil.java:133) 
~[jetty-util-10.0.22.jar:10.0.22]
   at org.eclipse.jetty.io.ByteBufferPool.newByteBuffer(ByteBufferPool.java:71) 
~[jetty-io-10.0.22.jar:10.0.22]
 at 
org.eclipse.jetty.io.MappedByteBufferPool.acquire(MappedByteBufferPool.java:159)
 ~[jetty-io-10.0.22.jar:10.0.22]
We’ve observed that for certain indexes, when SolrCloud follower nodes are 
streaming large index files (~5GB) from a shard leader, the underlying HTTP 
client, Jetty, can allocate enough direct buffer memory to cause an OOM. In our 
experience, once a direct buffer OOM happens on a Solr node, the node’s 
recovery process will be stuck in a failure loop until the JVM’s direct buffer 
limit is increased. Moreover, Solr is unable to use any Jetty HTTP clients once 
the direct buffer memory limit is reached. In one case, we had to raise a 
node’s direct memory limit to over 20GB (using -XX:MaxDirectMemorySize) in 
order to fully recover an index of size ~200GB from a shard leader. Through 
Solr’s JVM metrics, we've observed that hundreds of thousands of direct byte 
buffers (all around 16KB), are being allocated before an OOM occurs. Through 
those same metrics, we've observed that Solr nodes only use hundreds of buffers 
during normal operation.
We’re still investigating the root cause of this issue, but it has recurred 
enough for us to be certain that it’s a problem. The issue seems to pop up most 
frequently with NRT Solr nodes that are copying 100GB+ indexes, and it almost 
always happens when these nodes are processing updates while in recovery. 
However, even in extreme cases, these circumstances don’t always cause an OOM, 
and we’re not exactly sure why. We think that it might have something to do 
with the structure of the specific index (i.e. a higher number of larger 
segments is more likely to trigger the issue), but that’s mostly just a guess 
at this point.
It’s worth noting that we haven’t seen any issues with Jetty during normal 
operation, just in high-throughput situations where Solr is streaming index 
files while reading and writing to the TLOG. However, Jetty’s direct buffer 
allocation issues have caused problems with Solr code in the past. There’s 
already an open Jira for direct buffer memory leaks 
(https://issues.apache.org/jira/browse/SOLR-17376) and a corresponding Jetty 
issue (https://github.com/jetty/jetty.project/issues/12084). The OOM problem 
that we’re seeing now is likely a side effect of SOLR-16505 
(https://issues.apache.org/jira/browse/SOLR-16505), with the root cause being 
SOLR-17376.
Within the Solr codebase, we’ve been able to use JFR to determine that when our 
OOMs happen, most of Jetty’s direct buffer allocations are being triggered by 
IndexFetcher$FileFetcher.fetchPackets. This is the method that actually reads 
index files from the Jetty HTTP2 input stream and writes them to a file, which 
lines up with the observations we’ve made so far. 
We’d appreciate any community feedback on this issue and how it should be 
handled moving forwards. Also, even though our problem statement is currently 
kinda vague, we’d appreciate any help with independently reproducing our 
IndexFetcher OOM bug, or any similar direct buffer bugs which stem from using 
Http2SolrClient.
Thanks your for time!



Re: SolrJ timeouts

2025-05-22 Thread Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A)
Hi David,

I added the test you suggested here https://github.com/apache/solr/pull/3362

Surprisingly they pass. I think it is because although the idleTimeout set on 
the jetty HttpClient passed into the Http2SolrClient remains unchanged, the 
Http2SolrClient-level timeout is still enforced via Jetty's 
InputStreamResponseListener::get which effectively works like an idle timeout:

https://github.com/apache/solr/blob/b91905a653ead168cea9de2a1f1791c76fb17882/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L517

I got somewhat paranoid and added the SlowStream test servlet to make sure that 
both timeouts are respected when they are overridden. As far as I can tell, 
they are.

I will admit that it still feels weird to pass an idleTimeout into a 
Http2SolrClient and then *not* propagate it to the underlying Jetty HttpClient. 
Also, there could be other properties that don't work as nicely as these two 
when overridden on an Http2SolrClient that takes in a Jetty HttpClient. But I 
haven't given that a deeper look.

Thanks,
Luke


From: dev@solr.apache.org At: 05/19/25 16:32:09 UTC-4:00To:  dev@solr.apache.org
Cc:  Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A ) ,  stilla...@apache.org,  
gus.h...@gmail.com
Subject: Re: SolrJ timeouts

My annotated draft PR:
https://github.com/apache/solr/pull/3357
Doesn't solve the issue but its exception throwing means many tests fail
since those places in Solr were trying to both use an existing HttpClient
_and_ set timeouts.  We can't do that and we *weren't* doing that!  'course
if I'm missing something obvious; let me know!

> I guess my attempt to fix it (https://github.com/apache/solr/pull/3356)
is doomed to fail as it currently stands?

Agreed.  But let's see:  Maybe a test would be: configure a super low idle
timeout on an Http2SolrClient that's constructed manually.  Then create a
new Http2SolrClient with the first using withHttpClient and set a
reasonable timeout.  We should be able to do a request that takes time
(like a Solr request referencing "sleep(5000)" in a function query.  See
features-slow.json with such an excerpt.  I believe we'll sadly have a
timeout because the idle timeout customization isn't honored when
withHttpClient is used.  Such a test would go in Http2SolrClientTest.

On Mon, May 19, 2025 at 9:45 AM David Smiley  wrote:

> This weekend I was harmonizing some of the timeouts between JDK & Jetty
> based HTTP SolrClient implementations.  I saw some inconsistencies,
> duplication, and the need for javadocs to explain what these timeouts
> fundamentally mean.  I have some nice WIP code I was looking forward to
> sharing by now.
>
> Then I noticed something insidious:  If an Http2SolrClient (Jetty) is
> created using the builder's withHttpClient(...) method, then the idle (aka
> socket) timeout & connection timeout (and basically any other setting
> that's inside the Jetty HttpClient) cannot *actually* be customized
> henceforth.  This should be pretty obvious, when you think about it (duh).
> But we don't do any checks in construction to alert you, plus we
> redundantly put some of these settings on the Http2SolrClient (incl. base
> class) which fools any onlooker into thinking it might do something.  My
> WIP addresses this by throwing an IllegalArgumentException on build() if
> you try to set something that won't take effect.  This revealed a number of
> prominent places in Solr that both call withHttpClient and also customize
> the timeouts (to no avail):  Overseer, SolrClientCache (for streaming
> expressions), IndexFetcher, and some recovery stuff; maybe I missed
> something.  This solution won't work.
>
> Perhaps withHttpClient should not reuse the underlying HttpClient?  It's a
> shame to not reuse it but it's at least the safest choice.  In my opinion,
> withHttpClient doesn't have an ideal name either; it should be
> withHttpSolrClient.  Its current name is ambiguous; I hate confusing
> HttpSolrClient with HttpClient.  And if we copy configuration from the
> client but don't actually use it, then maybe we should name this method
> copyConfigurationFrom or withConfigurationFrom.
>
> I'll throw up a draft PR today that does the cleanup/harmonization but not
> yet solve the above conundrum.
>
> (CC'ed some folks that worked on timeouts)
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>