herley-shaori commented on PR #15824:
URL: https://github.com/apache/lucene/pull/15824#issuecomment-4068066208

   > The other problem where this may come from: Are you seeing this with the 
shared arenas enabled or disabled? The check can only be elided, if you NOT 
reopen segments all the time. In Opensearch there were serious problems with 
your "index liveness checks", remember the issues where you run out of file 
handles/mappings, see #15054.
   > 
   > The problem is whenever you close a file with a shared arena it causes a 
safepoint in the JVM, killing the optimizations in the top frame of all 
threads. So please make sure to use READ_ONCE for accessing metadata files in 
OpenSerach.
   > 
   > In our benchmarks we do not see those issues, because the checks are 
correctly elided, unless you close files and not using the default arena 
grouping behavious of Lucene. Elasticserach does not see the problem and also 
not our own benchmarks.
   > 
   > Can you possibly rerun a benchmark of your index with refresh of index 
disabled temporarily (so disabling NRT)?
   
   Thanks for the detailed analysis. These are excellent questions, but I 
should be transparent — I'm not the original issue reporter, and I don't run 
the OpenSearch environment where the regression was observed. I explored the 
code path and attempted a fix based on the analysis in #15820, but I can't 
answer questions about OpenSearch's arena configuration, segment lifecycle, or 
liveness check behavior.
   
   These questions about shared arenas, READ_ONCE for metadata files, and the 
NRT refresh interaction would be best directed at @navneet1v, @msfroh, and 
@jainankitk, who are closer to the OpenSearch side and can reproduce with the 
specific configurations you're asking about (shared vs non-shared arenas, 
refresh disabled, etc.).
   
   The safepoint theory is very compelling, though — if frequent segment closes 
on shared arenas are causing safepoint storms that prevent Hotspot from eliding 
checkValidStateRaw(), that would explain both why the nightly benchmarks 
(stable segments, no NRT churn) don't see the issue and why the OpenSearch 
workload (400 segments, refresh_interval=1s, constant segment turnover) does.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to