Boicehuang created LUCENE-10519:
-----------------------------------

             Summary: ThreadLocal.remove under G1GC takes 100% CPU
                 Key: LUCENE-10519
                 URL: https://issues.apache.org/jira/browse/LUCENE-10519
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/other
    Affects Versions: 8.11.1, 8.10.1, 8.9
         Environment: Elasticsearch v7.16.0

OpenJDK v11
            Reporter: Boicehuang


h2. Problem

 
----
There is a single ThreadLocalMap stored for each thread, which all ThreadLocals 
share, and that master map only periodically purges stale entries. When we 
close a CloseableThreadLocal, we only take care of the current thread right 
now, others will be
taken care of via the WeakReferences. Under G1GC, the WeakReferences of other 
threads may not be reclaimed even after several rounds of mix-GC. The 
ThreadLocalMap may grow very large, it can take an arbitrarily long amount of 
CPU and time to iterate the things you had stored in it.

Hot thread of elasticsearch:

 
{code:java}
::: 
{xxxxxxxxx}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{xxxxxxxx}{xxxxxxxx}
   Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, 
ignoreIdleThreads=true:
   
   105.3% (526.5ms out of 500ms) cpu usage by thread 
'elasticsearch[xxxxxxxx][bulk][T#31]'
     10/10 snapshots sharing following 34 elements
       
java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627)
       java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509)
       java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308)
       java.lang.ThreadLocal.remove(ThreadLocal.java:224)
       
java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426)
       
java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349)
       
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881)
       
org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49)
       
org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356)
       
org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272)
       org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812)
       
org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779)
       
org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750)
       
org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623)
       
org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577)
 {code}
 
h2. Solution
----
This bug does not reproduce under CMS. It can be reproduced always.

In fact we don't need to store entry twice in the hardRefs And ThreadLocals. 
Remove ThreadLocal from CloseableThreadLocal so that we would not be affected 
by the serious flaw of Java's built-in ThreadLocal. I would like to provide a 
patch to fix it if needed.
h2. Related issues
----
[https://github.com/elastic/elasticsearch/issues/56766]

[https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44]

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to