mayya-sharipova commented on issue #11507:
URL: https://github.com/apache/lucene/issues/11507#issuecomment-1607892454

   I would like to renew the issue in light of the recent [integration of 
incubating Panama Vector API](https://github.com/apache/lucene/pull/12311), as 
indexing of vectors with it much faster.
   
   We run a benchmarking test, and indexing a dataset of vectors of 1536 dims 
was slightly faster than indexing of 1024 dims. This gives us enough confidence 
to extend max dims to 2048.
   
   ### Test environment 
   - Dataset: 
     - [nq](https://huggingface.co/datasets/BeIR/nq) dataset with `text` field 
embedded with OpenAI `text-embedding-ada-002` model, 1536 dims
   
   - 
[KnnGraphTester](https://github.com/apache/lucene/blob/branch_9_7/lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java)
   -  maxConn: 16, beamWidthIndex: 100
   - Apple M1 laptop
   
   ### Test1:
   - Lucene 9.7 branch
   - Panama Vector API not enabled
   - vector dims=1024 (OpenAi vectors that were cut off to first 1024 dims)
   - Results: Indexed 2680961 documents in **3287s**
   
   <details>
    <summary>Details</summary>
   
   ```
    java -cp  "lib/*:classes" -Xmx16g -Xms16g 
org.apache.lucene.util.hnsw.KnnGraphTester -dim 1024 -ndoc 2680961 -reindex 
-docs vectors_dims1024.bin -maxConn 16 -beamWidthIndex 100
   creating index in vectors_dims1024.bin-16-100.index
   MS 0 [2023-06-26T11:10:24.765857Z; main]: initDynamicDefaults 
maxThreadCount=4 maxMergeCount=9
   IFD 0 [2023-06-26T11:10:24.782017Z; main]: init: current segments file is 
"segments"; 
deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@646d64ab
   IFD 0 [2023-06-26T11:10:24.783554Z; main]: now delete 0 files: []
   IFD 0 [2023-06-26T11:10:24.784291Z; main]: now checkpoint "" [0 segments ; 
isCommit = false]
   IFD 0 [2023-06-26T11:10:24.784338Z; main]: now delete 0 files: []
   IFD 0 [2023-06-26T11:10:24.785377Z; main]: 0 ms to checkpoint
   IW 0 [2023-06-26T11:10:24.785523Z; main]: init: create=true reader=null
   IW 0 [2023-06-26T11:10:24.790087Z; main]:
   
dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors_dims1024.bin-16-100.index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@2c039ac6
   index=
   version=9.7.0
   analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
   ramBufferSizeMB=1994.0
   maxBufferedDocs=-1
   mergedSegmentWarmer=null
   delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
   commit=null
   openMode=CREATE
   similarity=org.apache.lucene.search.similarities.BM25Similarity
   mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, 
ioThrottle=true
   codec=Lucene95
   infoStream=org.apache.lucene.util.PrintStreamInfoStream
   mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, 
maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, 
forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, 
maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0
   readerPooling=true
   perThreadHardLimitMB=1945
   useCompoundFile=false
   commitOnClose=true
   indexSort=null
   checkPendingFlushOnUpdate=true
   softDeletesField=null
   maxFullFlushMergeWaitMillis=500
   leafSorter=null
   eventListener=org.apache.lucene.index.IndexWriterEventListener$1@2173f6d9
   writer=org.apache.lucene.index.IndexWriter@307f6b8c
   
   IW 0 [2023-06-26T11:10:24.790232Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
   DWPT 0 [2023-06-26T11:19:47.652040Z; main]: flush postings as segment _0 
numDocs=460521
   IW 0 [2023-06-26T11:19:47.653761Z; main]: 1 ms to write norms
   IW 0 [2023-06-26T11:19:47.653954Z; main]: 0 ms to write docValues
   IW 0 [2023-06-26T11:19:47.654032Z; main]: 0 ms to write points
   IW 0 [2023-06-26T11:19:49.152263Z; main]: 1498 ms to write vectors
   IW 0 [2023-06-26T11:19:49.166472Z; main]: 14 ms to finish stored fields
   IW 0 [2023-06-26T11:19:49.166642Z; main]: 0 ms to write postings and finish 
vectors
   IW 0 [2023-06-26T11:19:49.167167Z; main]: 0 ms to write fieldInfos
   DWPT 0 [2023-06-26T11:19:49.167954Z; main]: new segment has 0 deleted docs
   DWPT 0 [2023-06-26T11:19:49.168030Z; main]: new segment has 0 soft-deleted 
docs
   DWPT 0 [2023-06-26T11:19:49.169572Z; main]: new segment has no vectors; no 
norms; no docValues; no prox; freqs
   DWPT 0 [2023-06-26T11:19:49.169670Z; main]: 
flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, 
_0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, 
_0.fdt, _0.fnm]
   ....
   Indexed 2680961 documents in 3287s
   ```
   </details>
   
   
   ### Test2
   - Lucene 9.7 branch with FloatVectorValues.MAX_DIMENSIONS set to 2048
   - Panama Vector API enabled
   - dims=1536
   - Results: Indexed 2680961 documents in **3141s** 
   
   <details>
    <summary>Details</summary>
   
   ```
   java --add-modules jdk.incubator.vector -cp  "lib/*:classes" -Xmx16g -Xms16g 
org.apache.lucene.util.hnsw.KnnGraphTester -dim 1536 -ndoc 2680961 -reindex 
-docs vectors.bin -maxConn 16 -beamWidthIndex 100
   
   WARNING: Using incubator modules: jdk.incubator.vector
   creating index in vectors.bin-16-100.index
   Jun 26, 2023 10:34:29 A.M. 
org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
   INFO: Using MemorySegmentIndexInput with Java 20; to disable start with 
-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
   MS 0 [2023-06-26T14:34:29.271516Z; main]: initDynamicDefaults 
maxThreadCount=4 maxMergeCount=9
   IFD 0 [2023-06-26T14:34:29.329779Z; main]: init: current segments file is 
"segments"; 
deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@64f6106c
   IFD 0 [2023-06-26T14:34:29.336415Z; main]: now delete 0 files: []
   IFD 0 [2023-06-26T14:34:29.338546Z; main]: now checkpoint "" [0 segments ; 
isCommit = false]
   IFD 0 [2023-06-26T14:34:29.338654Z; main]: now delete 0 files: []
   IFD 0 [2023-06-26T14:34:29.347243Z; main]: 2 ms to checkpoint
   IW 0 [2023-06-26T14:34:29.348255Z; main]: init: create=true reader=null
   IW 0 [2023-06-26T14:34:29.368686Z; main]:
   
dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors.bin-16-100.index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@319b92f3
   index=
   version=9.7.0
   analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
   ramBufferSizeMB=1994.0
   maxBufferedDocs=-1
   mergedSegmentWarmer=null
   delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
   commit=null
   openMode=CREATE
   similarity=org.apache.lucene.search.similarities.BM25Similarity
   mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, 
ioThrottle=true
   codec=Lucene95
   infoStream=org.apache.lucene.util.PrintStreamInfoStream
   mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, 
maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, 
forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, 
maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0
   readerPooling=true
   perThreadHardLimitMB=1945
   useCompoundFile=false
   commitOnClose=true
   indexSort=null
   checkPendingFlushOnUpdate=true
   softDeletesField=null
   maxFullFlushMergeWaitMillis=500
   leafSorter=null
   eventListener=org.apache.lucene.index.IndexWriterEventListener$1@10a035a0
   writer=org.apache.lucene.index.IndexWriter@67b467e9
   
   IW 0 [2023-06-26T14:34:29.369224Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
   Jun 26, 2023 10:34:29 A.M. org.apache.lucene.util.VectorUtilPanamaProvider 
<init>
   INFO: Java vector incubator API enabled; uses preferredBitSize=128
   DWPT 0 [2023-06-26T14:40:36.945965Z; main]: flush postings as segment _0 
numDocs=314897
   IW 0 [2023-06-26T14:40:36.949748Z; main]: 2 ms to write norms
   IW 0 [2023-06-26T14:40:36.950336Z; main]: 0 ms to write docValues
   IW 0 [2023-06-26T14:40:36.950452Z; main]: 0 ms to write points
   IW 0 [2023-06-26T14:40:38.639069Z; main]: 1688 ms to write vectors
   IW 0 [2023-06-26T14:40:38.669749Z; main]: 29 ms to finish stored fields
   IW 0 [2023-06-26T14:40:38.670044Z; main]: 0 ms to write postings and finish 
vectors
   IW 0 [2023-06-26T14:40:38.670847Z; main]: 0 ms to write fieldInfos
   DWPT 0 [2023-06-26T14:40:38.672893Z; main]: new segment has 0 deleted docs
   DWPT 0 [2023-06-26T14:40:38.673016Z; main]: new segment has 0 soft-deleted 
docs
   DWPT 0 [2023-06-26T14:40:38.675915Z; main]: new segment has no vectors; no 
norms; no docValues; no prox; freqs
   DWPT 0 [2023-06-26T14:40:38.676120Z; main]: 
flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, 
_0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, 
_0.fdt, _0.fnm]
   DWPT 0 [2023-06-26T14:40:38.676311Z; main]: flushed codec=Lucene95
   DWPT 0 [2023-06-26T14:40:38.677609Z; main]: flushed: segment=_0 
ramUsed=1,945.012 MB newFlushedSize=1,863.46 MB docs/MB=168.985
   DWPT 0 [2023-06-26T14:40:38.680696Z; main]: flush time 1735.77025 ms
   IW 0 [2023-06-26T14:40:38.682741Z; main]: publishFlushedSegment seg-private 
updates=null
   IW 0 [2023-06-26T14:40:38.683738Z; main]: publishFlushedSegment 
_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, 
os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, 
java.runtime.version=20.0.1+9-29, 
timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}]
 :id=717x28qrd00q2ke3d17eerf4x
   BD 0 [2023-06-26T14:40:38.687864Z; main]: finished packet delGen=1 now 
completedDelGen=1
   IW 0 [2023-06-26T14:40:38.691420Z; main]: publish sets newSegment delGen=1 
seg=_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, 
os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, 
java.runtime.version=20.0.1+9-29, 
timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}]
 :id=717x28qrd00q2ke3d17eerf4x
   IFD 0 [2023-06-26T14:40:38.692639Z; main]: now checkpoint 
"_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, 
os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, 
java.runtime.version=20.0.1+9-29, 
timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}]
 :id=717x28qrd00q2ke3d17eerf4y" [1 segments ; isCommit = false]
   IFD 0 [2023-06-26T14:40:38.693268Z; main]: now delete 0 files: []
   IFD 0 [2023-06-26T14:40:38.693464Z; main]: 1 ms to checkpoint
   MP 0 [2023-06-26T14:40:38.700301Z; main]:   
seg=_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, 
os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, 
java.runtime.version=20.0.1+9-29, 
timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}]
 :id=717x28qrd00q2ke3d17eerf4y size=1863.460 MB
   MP 0 [2023-06-26T14:40:38.701368Z; main]: findMerges: 1 segments
   MP 0 [2023-06-26T14:40:38.701645Z; main]:   allowedSegmentCount=10 vs 
count=1 (eligible count=1)   
    ...
   
   Indexed 2680961 documents in 3141s
   ```
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to