herley-shaori commented on PR #15824: URL: https://github.com/apache/lucene/pull/15824#issuecomment-4068066208
> The other problem where this may come from: Are you seeing this with the shared arenas enabled or disabled? The check can only be elided, if you NOT reopen segments all the time. In Opensearch there were serious problems with your "index liveness checks", remember the issues where you run out of file handles/mappings, see #15054. > > The problem is whenever you close a file with a shared arena it causes a safepoint in the JVM, killing the optimizations in the top frame of all threads. So please make sure to use READ_ONCE for accessing metadata files in OpenSerach. > > In our benchmarks we do not see those issues, because the checks are correctly elided, unless you close files and not using the default arena grouping behavious of Lucene. Elasticserach does not see the problem and also not our own benchmarks. > > Can you possibly rerun a benchmark of your index with refresh of index disabled temporarily (so disabling NRT)? Thanks for the detailed analysis. These are excellent questions, but I should be transparent — I'm not the original issue reporter, and I don't run the OpenSearch environment where the regression was observed. I explored the code path and attempted a fix based on the analysis in #15820, but I can't answer questions about OpenSearch's arena configuration, segment lifecycle, or liveness check behavior. These questions about shared arenas, READ_ONCE for metadata files, and the NRT refresh interaction would be best directed at @navneet1v, @msfroh, and @jainankitk, who are closer to the OpenSearch side and can reproduce with the specific configurations you're asking about (shared vs non-shared arenas, refresh disabled, etc.). The safepoint theory is very compelling, though — if frequent segment closes on shared arenas are causing safepoint storms that prevent Hotspot from eliding checkValidStateRaw(), that would explain both why the nightly benchmarks (stable segments, no NRT churn) don't see the issue and why the OpenSearch workload (400 segments, refresh_interval=1s, constant segment turnover) does. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
