kaivalnp opened a new pull request, #14847:
URL: https://github.com/apache/lucene/pull/14847

   ### Description
   
   I was trying to index a large number of vectors in a single segment, and ran 
into an error because of the way we [copy vectors to native 
memory](https://github.com/apache/lucene/blob/2b47cd3cd69b83c9798bfa109ba05b2666947b8e/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java#L222-L224),
 before calling Faiss to create an index:
   
   ```
   Caused by: java.lang.IllegalStateException: Segment is too large to wrap as 
ByteBuffer. Size: 3276800000
           at 
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:314)
           at 
java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.checkArraySize(AbstractMemorySegmentImpl.java:374)
           at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:158)
   ```
   
   This limitation was hit because we use a `ByteBuffer` (backed by native 
memory) to copy vectors from heap -- which has a 2 GB size limit
   
   As a fix, I've changed it to use `MemorySegment` specific functions to copy 
vectors (also moving away from these byte buffers in other places, and using 
more appropriate IO methods)
   
   With these changes, we no longer see the above error and are able to build 
and search an index. Also ran benchmarks for a case where this limit was _not_ 
hit to check for performance impact:
   
   Baseline (on `main`):
   ```
       type  recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  
maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  
num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
      faiss   0.997        1.855   1.819        0.981  100000   100      50     
  32        200         no     31.07       3218.44           32.76             
1         3152.11      1562.500     1562.500       HNSW
   ```
   
   Candidate (on this PR):
   ```
       type  recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  
maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  
num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
      faiss   0.998        1.817   1.794        0.987  100000   100      50     
  32        200         no     29.57       3381.46           33.20             
1         3152.11      1562.500     1562.500       HNSW
   ```
   
   ..and indexing / search performance is largely unchanged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to