[ https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lucifer Boice updated LUCENE-10519: ----------------------------------- Summary: Improvement for CloseableThreadLocal (was: ThreadLocal.remove under G1GC takes 100% CPU) > Improvement for CloseableThreadLocal > ------------------------------------ > > Key: LUCENE-10519 > URL: https://issues.apache.org/jira/browse/LUCENE-10519 > Project: Lucene - Core > Issue Type: Bug > Components: core/other > Affects Versions: 8.9, 8.10.1, 8.11.1 > Environment: Elasticsearch v7.16.0 > OpenJDK v11 > Reporter: Lucifer Boice > Priority: Critical > Labels: CloseableThreadLocal > Time Spent: 2h 40m > Remaining Estimate: 0h > > h2. Problem > ---- > {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using > {*}ThreadLocal<WeakReference<T>>{*}) may still have a flaw under G1GC. There > is a single ThreadLocalMap stored for each thread, which all ThreadLocals > share, and that master map only periodically purges stale entries. When we > close a CloseableThreadLocal, we only take care of the current thread right > now, others will be taken care of via the WeakReferences. Under G1GC, the > WeakReferences of other threads may not be recycled even after several rounds > of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily > long amount of CPU and time to iterate the things you had stored in it. > Hot thread of elasticsearch: > {code:java} > ::: > {xxxxxxxxx}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{xxxxxxxx}{xxxxxxxx} > Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, > ignoreIdleThreads=true: > > 105.3% (526.5ms out of 500ms) cpu usage by thread > 'elasticsearch[xxxxxxxx][bulk][T#31]' > 10/10 snapshots sharing following 34 elements > > java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627) > java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509) > java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308) > java.lang.ThreadLocal.remove(ThreadLocal.java:224) > > java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349) > > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881) > > org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49) > > org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356) > > org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272) > org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577) > {code} > h2. Solution > ---- > This bug does not reproduce under CMS. It can be reproduced under G1GC always. > In fact, *CloseableThreadLocal* doesn't need to store entry twice in the > hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so > that we would not be affected by the serious flaw of Java's built-in > ThreadLocal. > h2. See also > ---- > [https://github.com/elastic/elasticsearch/issues/56766] > [https://bugs.openjdk.java.net/browse/JDK-8182982] > [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44] > -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org