weizijun opened a new pull request, #14527:
URL: https://github.com/apache/lucene/pull/14527

   When bbq is used with lucene, one datanode can contain more data.
   So when more shards are merged concurrently, there will be a problem of very 
high heap memory size.
   I found that the NeighborArray object was taking up a lot of memory. And I 
found that the number of nodes always fails to reach maxSize. It only uses 
about 1/3 or 1/4 of maxSize.
   Therefore, I use FloatArrayList\IntArrayList to replace float[]\int[], which 
can significantly reduce the heap memory usage.
   
   Here is a comparison of the jmap histo results(I set the parameter of m = 
64):
   before:
   ```
    num     #instances         #bytes  class name (module)
   -------------------------------------------------------
      1:      11443026     6396808120  [F (java.base@21.0.2.0.2-AJDK)
      2:      11387631     6129931608  [I (java.base@21.0.2.0.2-AJDK)
      3:       3265644     1319152760  [B (java.base@21.0.2.0.2-AJDK)
      4:      11308339      361866848  
org.apache.lucene.util.hnsw.NeighborArray 
(org.apache.lucene.core@10.0.0-ali1.0.1)
      5:      11134203      267240168  
[Lorg.apache.lucene.util.hnsw.NeighborArray; 
(org.apache.lucene.core@10.0.0-ali1.0.1)
      6:            77       57916272  
[[Lorg.apache.lucene.util.hnsw.NeighborArray; 
(org.apache.lucene.core@10.0.0-ali1.0.1)
      7:       2404231       57701544  java.lang.String 
(java.base@21.0.2.0.2-AJDK)
      8:         34911       42546120  Ljdk.internal.vm.FillerArray; 
(java.base@21.0.2.0.2-AJDK)
      9:        772788       30911520  
org.nlpcn.commons.lang.tire.domain.SmartForest
     10:        113758       19111344  
org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame 
(org.apache.lucene.core@10.0.0-ali1.0.1)
     11:        545656       17460992  java.util.HashMap$Node 
(java.base@21.0.2.0.2-AJDK)
   ```
   
   after:
   ```
   num     #instances         #bytes  class name (module)
   -------------------------------------------------------
      1:       9228299     1612257464  [F (java.base@21.0.2.0.2-AJDK)
      2:       9246406     1402537720  [I (java.base@21.0.2.0.2-AJDK)
      3:       3279264     1141869192  [B (java.base@21.0.2.0.2-AJDK)
      4:       9124020      364960800  
org.apache.lucene.util.hnsw.NeighborArray 
(org.apache.lucene.core@10.1.0-reduce-hnsw-size)
      5:       9124036      218976864  
org.apache.lucene.internal.hppc.FloatArrayList 
(org.apache.lucene.core@10.1.0-reduce-hnsw-size)
      6:       9124036      218976864  
org.apache.lucene.internal.hppc.IntArrayList 
(org.apache.lucene.core@10.1.0-reduce-hnsw-size)
      7:       8983027      215608448  
[Lorg.apache.lucene.util.hnsw.NeighborArray; 
(org.apache.lucene.core@10.1.0-reduce-hnsw-size)
      8:       2492594       59822256  java.lang.String 
(java.base@21.0.2.0.2-AJDK)
      9:            56       51013776  
[[Lorg.apache.lucene.util.hnsw.NeighborArray; 
(org.apache.lucene.core@10.1.0-reduce-hnsw-size)
     10:        772788       30911520  
org.nlpcn.commons.lang.tire.domain.SmartForest
     11:         68970       28703992  Ljdk.internal.vm.FillerArray; 
(java.base@21.0.2.0.2-AJDK)
   ```
   
   The avg size of float[] is 559 before.
   The avg size of float[] is 174 after.
   
   The avg size of int[] is 538 before.
   The avg size of int[] is 151 after.
   
   I tests some dataset like GIST 100K vectors, 960 dimensions\LAION 100M 
vectors, 768 dimensions. They have similar conclusions.
   I haven't tested the performance very rigorously. It seems that this 
modification has no impact on performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to