kaivalnp opened a new pull request, #14847: URL: https://github.com/apache/lucene/pull/14847
### Description I was trying to index a large number of vectors in a single segment, and ran into an error because of the way we [copy vectors to native memory](https://github.com/apache/lucene/blob/2b47cd3cd69b83c9798bfa109ba05b2666947b8e/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java#L222-L224), before calling Faiss to create an index: ``` Caused by: java.lang.IllegalStateException: Segment is too large to wrap as ByteBuffer. Size: 3276800000 at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:314) at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.checkArraySize(AbstractMemorySegmentImpl.java:374) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:158) ``` This limitation was hit because we use a `ByteBuffer` (backed by native memory) to copy vectors from heap -- which has a 2 GB size limit As a fix, I've changed it to use `MemorySegment` specific functions to copy vectors (also moving away from these byte buffers in other places, and using more appropriate IO methods) With these changes, we no longer see the above error and are able to build and search an index. Also ran benchmarks for a case where this limit was _not_ hit to check for performance impact: Baseline (on `main`): ``` type recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType faiss 0.997 1.855 1.819 0.981 100000 100 50 32 200 no 31.07 3218.44 32.76 1 3152.11 1562.500 1562.500 HNSW ``` Candidate (on this PR): ``` type recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType faiss 0.998 1.817 1.794 0.987 100000 100 50 32 200 no 29.57 3381.46 33.20 1 3152.11 1562.500 1562.500 HNSW ``` ..and indexing / search performance is largely unchanged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org