mayya-sharipova commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1607892454
I would like to renew the issue in light of the recent [integration of incubating Panama Vector API](https://github.com/apache/lucene/pull/12311), as indexing of vectors with it much faster. We run a benchmarking test, and indexing a dataset of vectors of 1536 dims was slightly faster than indexing of 1024 dims. This gives us enough confidence to extend max dims to 2048. ### Test environment - Dataset: - [nq](https://huggingface.co/datasets/BeIR/nq) dataset with `text` field embedded with OpenAI `text-embedding-ada-002` model, 1536 dims - [KnnGraphTester](https://github.com/apache/lucene/blob/branch_9_7/lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java) - maxConn: 16, beamWidthIndex: 100 - Apple M1 laptop ### Test1: - Lucene 9.7 branch - Panama Vector API not enabled - vector dims=1024 (OpenAi vectors that were cut off to first 1024 dims) - Results: Indexed 2680961 documents in **3287s** <details> <summary>Details</summary> ``` java -cp "lib/*:classes" -Xmx16g -Xms16g org.apache.lucene.util.hnsw.KnnGraphTester -dim 1024 -ndoc 2680961 -reindex -docs vectors_dims1024.bin -maxConn 16 -beamWidthIndex 100 creating index in vectors_dims1024.bin-16-100.index MS 0 [2023-06-26T11:10:24.765857Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9 IFD 0 [2023-06-26T11:10:24.782017Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@646d64ab IFD 0 [2023-06-26T11:10:24.783554Z; main]: now delete 0 files: [] IFD 0 [2023-06-26T11:10:24.784291Z; main]: now checkpoint "" [0 segments ; isCommit = false] IFD 0 [2023-06-26T11:10:24.784338Z; main]: now delete 0 files: [] IFD 0 [2023-06-26T11:10:24.785377Z; main]: 0 ms to checkpoint IW 0 [2023-06-26T11:10:24.785523Z; main]: init: create=true reader=null IW 0 [2023-06-26T11:10:24.790087Z; main]: dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors_dims1024.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2c039ac6 index= version=9.7.0 analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer ramBufferSizeMB=1994.0 maxBufferedDocs=-1 mergedSegmentWarmer=null delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy commit=null openMode=CREATE similarity=org.apache.lucene.search.similarities.BM25Similarity mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true codec=Lucene95 infoStream=org.apache.lucene.util.PrintStreamInfoStream mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0 readerPooling=true perThreadHardLimitMB=1945 useCompoundFile=false commitOnClose=true indexSort=null checkPendingFlushOnUpdate=true softDeletesField=null maxFullFlushMergeWaitMillis=500 leafSorter=null eventListener=org.apache.lucene.index.IndexWriterEventListener$1@2173f6d9 writer=org.apache.lucene.index.IndexWriter@307f6b8c IW 0 [2023-06-26T11:10:24.790232Z; main]: MMapDirectory.UNMAP_SUPPORTED=true DWPT 0 [2023-06-26T11:19:47.652040Z; main]: flush postings as segment _0 numDocs=460521 IW 0 [2023-06-26T11:19:47.653761Z; main]: 1 ms to write norms IW 0 [2023-06-26T11:19:47.653954Z; main]: 0 ms to write docValues IW 0 [2023-06-26T11:19:47.654032Z; main]: 0 ms to write points IW 0 [2023-06-26T11:19:49.152263Z; main]: 1498 ms to write vectors IW 0 [2023-06-26T11:19:49.166472Z; main]: 14 ms to finish stored fields IW 0 [2023-06-26T11:19:49.166642Z; main]: 0 ms to write postings and finish vectors IW 0 [2023-06-26T11:19:49.167167Z; main]: 0 ms to write fieldInfos DWPT 0 [2023-06-26T11:19:49.167954Z; main]: new segment has 0 deleted docs DWPT 0 [2023-06-26T11:19:49.168030Z; main]: new segment has 0 soft-deleted docs DWPT 0 [2023-06-26T11:19:49.169572Z; main]: new segment has no vectors; no norms; no docValues; no prox; freqs DWPT 0 [2023-06-26T11:19:49.169670Z; main]: flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, _0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, _0.fdt, _0.fnm] .... Indexed 2680961 documents in 3287s ``` </details> ### Test2 - Lucene 9.7 branch with FloatVectorValues.MAX_DIMENSIONS set to 2048 - Panama Vector API enabled - dims=1536 - Results: Indexed 2680961 documents in **3141s** <details> <summary>Details</summary> ``` java --add-modules jdk.incubator.vector -cp "lib/*:classes" -Xmx16g -Xms16g org.apache.lucene.util.hnsw.KnnGraphTester -dim 1536 -ndoc 2680961 -reindex -docs vectors.bin -maxConn 16 -beamWidthIndex 100 WARNING: Using incubator modules: jdk.incubator.vector creating index in vectors.bin-16-100.index Jun 26, 2023 10:34:29 A.M. org.apache.lucene.store.MemorySegmentIndexInputProvider <init> INFO: Using MemorySegmentIndexInput with Java 20; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false MS 0 [2023-06-26T14:34:29.271516Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9 IFD 0 [2023-06-26T14:34:29.329779Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@64f6106c IFD 0 [2023-06-26T14:34:29.336415Z; main]: now delete 0 files: [] IFD 0 [2023-06-26T14:34:29.338546Z; main]: now checkpoint "" [0 segments ; isCommit = false] IFD 0 [2023-06-26T14:34:29.338654Z; main]: now delete 0 files: [] IFD 0 [2023-06-26T14:34:29.347243Z; main]: 2 ms to checkpoint IW 0 [2023-06-26T14:34:29.348255Z; main]: init: create=true reader=null IW 0 [2023-06-26T14:34:29.368686Z; main]: dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@319b92f3 index= version=9.7.0 analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer ramBufferSizeMB=1994.0 maxBufferedDocs=-1 mergedSegmentWarmer=null delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy commit=null openMode=CREATE similarity=org.apache.lucene.search.similarities.BM25Similarity mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true codec=Lucene95 infoStream=org.apache.lucene.util.PrintStreamInfoStream mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0 readerPooling=true perThreadHardLimitMB=1945 useCompoundFile=false commitOnClose=true indexSort=null checkPendingFlushOnUpdate=true softDeletesField=null maxFullFlushMergeWaitMillis=500 leafSorter=null eventListener=org.apache.lucene.index.IndexWriterEventListener$1@10a035a0 writer=org.apache.lucene.index.IndexWriter@67b467e9 IW 0 [2023-06-26T14:34:29.369224Z; main]: MMapDirectory.UNMAP_SUPPORTED=true Jun 26, 2023 10:34:29 A.M. org.apache.lucene.util.VectorUtilPanamaProvider <init> INFO: Java vector incubator API enabled; uses preferredBitSize=128 DWPT 0 [2023-06-26T14:40:36.945965Z; main]: flush postings as segment _0 numDocs=314897 IW 0 [2023-06-26T14:40:36.949748Z; main]: 2 ms to write norms IW 0 [2023-06-26T14:40:36.950336Z; main]: 0 ms to write docValues IW 0 [2023-06-26T14:40:36.950452Z; main]: 0 ms to write points IW 0 [2023-06-26T14:40:38.639069Z; main]: 1688 ms to write vectors IW 0 [2023-06-26T14:40:38.669749Z; main]: 29 ms to finish stored fields IW 0 [2023-06-26T14:40:38.670044Z; main]: 0 ms to write postings and finish vectors IW 0 [2023-06-26T14:40:38.670847Z; main]: 0 ms to write fieldInfos DWPT 0 [2023-06-26T14:40:38.672893Z; main]: new segment has 0 deleted docs DWPT 0 [2023-06-26T14:40:38.673016Z; main]: new segment has 0 soft-deleted docs DWPT 0 [2023-06-26T14:40:38.675915Z; main]: new segment has no vectors; no norms; no docValues; no prox; freqs DWPT 0 [2023-06-26T14:40:38.676120Z; main]: flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, _0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, _0.fdt, _0.fnm] DWPT 0 [2023-06-26T14:40:38.676311Z; main]: flushed codec=Lucene95 DWPT 0 [2023-06-26T14:40:38.677609Z; main]: flushed: segment=_0 ramUsed=1,945.012 MB newFlushedSize=1,863.46 MB docs/MB=168.985 DWPT 0 [2023-06-26T14:40:38.680696Z; main]: flush time 1735.77025 ms IW 0 [2023-06-26T14:40:38.682741Z; main]: publishFlushedSegment seg-private updates=null IW 0 [2023-06-26T14:40:38.683738Z; main]: publishFlushedSegment _0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4x BD 0 [2023-06-26T14:40:38.687864Z; main]: finished packet delGen=1 now completedDelGen=1 IW 0 [2023-06-26T14:40:38.691420Z; main]: publish sets newSegment delGen=1 seg=_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4x IFD 0 [2023-06-26T14:40:38.692639Z; main]: now checkpoint "_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4y" [1 segments ; isCommit = false] IFD 0 [2023-06-26T14:40:38.693268Z; main]: now delete 0 files: [] IFD 0 [2023-06-26T14:40:38.693464Z; main]: 1 ms to checkpoint MP 0 [2023-06-26T14:40:38.700301Z; main]: seg=_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4y size=1863.460 MB MP 0 [2023-06-26T14:40:38.701368Z; main]: findMerges: 1 segments MP 0 [2023-06-26T14:40:38.701645Z; main]: allowedSegmentCount=10 vs count=1 (eligible count=1) ... Indexed 2680961 documents in 3141s ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org