[ 
https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528621#comment-17528621
 ] 

Jaison.Bi commented on LUCENE-10519:
------------------------------------

Thanks for the comment, [~sokolov] . I try to explain the issue further. 

Each read thread should has own StoredFieldsReader, I think that's why 
CloseableThreadLocal/ThreadLocal is used.

When one CloseableThreadLocal instance is created, it can be used by multiple 
threads. 

Suppose threads \{Thread-1, Thread-2, Thread-3, Thread-4} are using same 
CloseableThreadLocal instance, each thread set own value.

So each thread will store the ThreadLocal instance(t) into it's ThreadLocalMap:

    Thread-1: \{ThreadLocalMap: [t]}
    Thread-2: \{ThreadLocalMap: [t]}
    Thread-3: \{ThreadLocalMap: [t]}
    Thread-4: \{ThreadLocalMap: [t]}

Suppose CloseableThreadLocal instance got closed by Thread-1. Only Thread-1 
removed the ThreadLocal instance from ThreadLocalMap.  
    
    Thread-1: \{ThreadLocalMap: []}
    Thread-2: \{ThreadLocalMap: [t]}
    Thread-3: \{ThreadLocalMap: [t]}
    Thread-4: \{ThreadLocalMap: [t]}
    
The other 3 threads only rely on GC reclaim them. Under G1 GC, the problem gets 
worse, since each mixed GC only collect partial old regions.
So ThreadLocalMap entries table may get very huge. That's why  
"ReentrantReadWriteLock$ReadLock.unlock" took long time(It need to take long 
time to expunge stale entries from a huge table)

Regarding on the patch, we keep the similar behavior to ThreadLocal, but avoid 
use ThreadLocal anymore.

> Improvement for CloseableThreadLocal
> ------------------------------------
>
>                 Key: LUCENE-10519
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10519
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/other
>    Affects Versions: 8.9, 8.10.1, 8.11.1
>         Environment: Elasticsearch v7.16.0
> OpenJDK v11
>            Reporter: Lucifer Boice
>            Priority: Critical
>              Labels: CloseableThreadLocal
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> h2. Problem
> ----
> {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using 
> {*}ThreadLocal<WeakReference<T>>{*}) may still have a flaw under G1GC. There 
> is a single ThreadLocalMap stored for each thread, which all ThreadLocals 
> share, and that master map only periodically purges stale entries. When we 
> close a CloseableThreadLocal, we only take care of the current thread right 
> now, others will be taken care of via the WeakReferences. Under G1GC, the 
> WeakReferences of other threads may not be recycled even after several rounds 
> of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily 
> long amount of CPU and time to iterate the things you had stored in it.
> Hot thread of elasticsearch:
> {code:java}
> ::: 
> {xxxxxxxxx}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{xxxxxxxx}{xxxxxxxx}
>    Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, 
> ignoreIdleThreads=true:
>    
>    105.3% (526.5ms out of 500ms) cpu usage by thread 
> 'elasticsearch[xxxxxxxx][bulk][T#31]'
>      10/10 snapshots sharing following 34 elements
>        
> java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627)
>        java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509)
>        java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308)
>        java.lang.ThreadLocal.remove(ThreadLocal.java:224)
>        
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426)
>        
> java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349)
>        
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881)
>        
> org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49)
>        
> org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356)
>        
> org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272)
>        org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812)
>        
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779)
>        
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750)
>        
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623)
>        
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577)
>  {code}
> h2. Solution
> ----
> This bug does not reproduce under CMS. It can be reproduced under G1GC always.
> In fact, *CloseableThreadLocal* doesn't need to store entry twice in the 
> hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so 
> that we would not be affected by the serious flaw of Java's built-in 
> ThreadLocal. 
> h2. See also
> ----
> [https://github.com/elastic/elasticsearch/issues/56766]
> [https://bugs.openjdk.java.net/browse/JDK-8182982]
> [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to