weizijun opened a new pull request, #14527: URL: https://github.com/apache/lucene/pull/14527
When bbq is used with lucene, one datanode can contain more data. So when more shards are merged concurrently, there will be a problem of very high heap memory size. I found that the NeighborArray object was taking up a lot of memory. And I found that the number of nodes always fails to reach maxSize. It only uses about 1/3 or 1/4 of maxSize. Therefore, I use FloatArrayList\IntArrayList to replace float[]\int[], which can significantly reduce the heap memory usage. Here is a comparison of the jmap histo results(I set the parameter of m = 64): before: ``` num #instances #bytes class name (module) ------------------------------------------------------- 1: 11443026 6396808120 [F (java.base@21.0.2.0.2-AJDK) 2: 11387631 6129931608 [I (java.base@21.0.2.0.2-AJDK) 3: 3265644 1319152760 [B (java.base@21.0.2.0.2-AJDK) 4: 11308339 361866848 org.apache.lucene.util.hnsw.NeighborArray (org.apache.lucene.core@10.0.0-ali1.0.1) 5: 11134203 267240168 [Lorg.apache.lucene.util.hnsw.NeighborArray; (org.apache.lucene.core@10.0.0-ali1.0.1) 6: 77 57916272 [[Lorg.apache.lucene.util.hnsw.NeighborArray; (org.apache.lucene.core@10.0.0-ali1.0.1) 7: 2404231 57701544 java.lang.String (java.base@21.0.2.0.2-AJDK) 8: 34911 42546120 Ljdk.internal.vm.FillerArray; (java.base@21.0.2.0.2-AJDK) 9: 772788 30911520 org.nlpcn.commons.lang.tire.domain.SmartForest 10: 113758 19111344 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame (org.apache.lucene.core@10.0.0-ali1.0.1) 11: 545656 17460992 java.util.HashMap$Node (java.base@21.0.2.0.2-AJDK) ``` after: ``` num #instances #bytes class name (module) ------------------------------------------------------- 1: 9228299 1612257464 [F (java.base@21.0.2.0.2-AJDK) 2: 9246406 1402537720 [I (java.base@21.0.2.0.2-AJDK) 3: 3279264 1141869192 [B (java.base@21.0.2.0.2-AJDK) 4: 9124020 364960800 org.apache.lucene.util.hnsw.NeighborArray (org.apache.lucene.core@10.1.0-reduce-hnsw-size) 5: 9124036 218976864 org.apache.lucene.internal.hppc.FloatArrayList (org.apache.lucene.core@10.1.0-reduce-hnsw-size) 6: 9124036 218976864 org.apache.lucene.internal.hppc.IntArrayList (org.apache.lucene.core@10.1.0-reduce-hnsw-size) 7: 8983027 215608448 [Lorg.apache.lucene.util.hnsw.NeighborArray; (org.apache.lucene.core@10.1.0-reduce-hnsw-size) 8: 2492594 59822256 java.lang.String (java.base@21.0.2.0.2-AJDK) 9: 56 51013776 [[Lorg.apache.lucene.util.hnsw.NeighborArray; (org.apache.lucene.core@10.1.0-reduce-hnsw-size) 10: 772788 30911520 org.nlpcn.commons.lang.tire.domain.SmartForest 11: 68970 28703992 Ljdk.internal.vm.FillerArray; (java.base@21.0.2.0.2-AJDK) ``` The avg size of float[] is 559 before. The avg size of float[] is 174 after. The avg size of int[] is 538 before. The avg size of int[] is 151 after. I tests some dataset like GIST 100K vectors, 960 dimensions\LAION 100M vectors, 768 dimensions. They have similar conclusions. I haven't tested the performance very rigorously. It seems that this modification has no impact on performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org