luyuncheng opened a new pull request, #11987:
URL: https://github.com/apache/lucene/pull/11987

   ### Description
   we have a es cluster(31G heap, 96G Mem, 30 instance nodes), with many shards 
per node(4000 per nodes), when nodes do many bulk and search requests 
concurrently, we can see the jvm going high memory usage, and can not release 
the memory even with the frequently GC and stop all write/search requests. we 
have to restart the node for recovery the heap, like the following GC metrics 
shows 
   
![image](https://user-images.githubusercontent.com/12760367/204531778-0c8e24ce-a927-492c-a173-cb2905a43c41.png)
   
   we dumped the heap shows, `CompressingStoredFieldsReader` oncupied 70% heap: 
   
![image](https://user-images.githubusercontent.com/12760367/204548626-3cfe59b0-f007-4695-802e-0ed542f8f4a5.png)
   
   all this reader path2GC roots shows with following(maybe in search or write 
thread):
   
![image](https://user-images.githubusercontent.com/12760367/204550346-21a7b219-2051-4333-910d-27138def8f3b.png)
   
   ### Root cause
   i think the root cause that these threadlocal holds the referent, because 
`SegmentReader#getFieldsReader` calling following code, and Elasticsearch 
always using fixed thread_pool and never __calling 
`CloseableThreadLocal#purge`__
   
   ```
   In `lucene/core/src/java/org/apache/lucene/index/SegmentCoreReaders.java` 
defined fieldsReaderLocal
     final CloseableThreadLocal<StoredFieldsReader> fieldsReaderLocal =
         new CloseableThreadLocal<StoredFieldsReader>() {
           @Override
           protected StoredFieldsReader initialValue() {
             return fieldsReaderOrig.clone();
           }
         };
   ```
   
   we have searched some issues like [LUCENE-9959 
](https://issues.apache.org/jira/browse/LUCENE-9959),  and 
[LUCENE-10419](https://issues.apache.org/jira/browse/LUCENE-10519), there is no 
answer for this problem
   
   ---
   i compare between different jvm heap, and different LUCENE versions, i think 
the  root cause is `LZ4WithPresetDictDecompressor`  would allocate a buffer in 
the class and init 
   ```
       LZ4WithPresetDictDecompressor() {
         compressedLengths = new int[0];
         buffer = new byte[0];
       }
   ```
   
   when the elasticsearch instance doing `Stored-Fields-Read` operations, it 
will reallocate the JVM heap. but without release, because es 
`currentEngineReference` will keep the reference
   
![image](https://user-images.githubusercontent.com/12760367/204552928-9e8f2b5f-ce61-4cbb-93eb-bc1fee4a597a.png)
   
   ### Proposal
   i think we can releasee this buffer memory when the decompress is done. it 
shows that jvm can holds more segment readers in the heap.
   when these buffer memory can release, the heap metrics shows as following:
   
![image](https://user-images.githubusercontent.com/12760367/204555346-fd6be181-eb8b-4014-9cd1-1e17aee4282e.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to