mayya-sharipova commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1629661053
@rmuir > Can we run this test with lucene's defaults (e.g. not a 2GB rambuffer)? I've done the test and surprising indexing time decreased substantially. It is almost 2 times faster to index with Lucene's defaults than with 2Gb RamBuffer at the expense that we end up with a bigger number of segments. - Lucene 9.7 branch with FloatVectorValues.MAX_DIMENSIONS set to 2048 - preferredBitSize=128 - Panama Vector API enabled - vector dims: 1536 - num of docs: 2.68M | RamBuffer Size | Indexing time | Num of segments | |----------: |-------------:|------:| | 16 Mb | 1877 s | 19| | 1994 Mb | 3141s | 9 | <details> <summary>Details</summary> ``` WARNING: Using incubator modules: jdk.incubator.vector Jul 10, 2023 3:35:25 P.M. org.apache.lucene.store.MemorySegmentIndexInputProvider <init> INFO: Using MemorySegmentIndexInput with Java 20; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false Jul 10, 2023 3:35:26 P.M. org.apache.lucene.util.VectorUtilPanamaProvider <init> INFO: Java vector incubator API enabled; uses preferredBitSize=128 _fc.fdt _v6.fnm _vj.si _vr_Lucene95HnswVectorsFormat_0.vec _fc.fdx _v6.si _vj_Lucene95HnswVectorsFormat_0.vec _vr_Lucene95HnswVectorsFormat_0.vem _fc.fnm _v6_Lucene95HnswVectorsFormat_0.vec _vj_Lucene95HnswVectorsFormat_0.vem _vr_Lucene95HnswVectorsFormat_0.vex _fc.si _v6_Lucene95HnswVectorsFormat_0.vem _vj_Lucene95HnswVectorsFormat_0.vex _vs.fdm _fc_Lucene95HnswVectorsFormat_0.vec _v6_Lucene95HnswVectorsFormat_0.vex _vl.fdm _vs.fdt creating index in vectors.bin-16-100.index MS 0 [2023-07-10T14:47:25.668178Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9 IFD 0 [2023-07-10T14:47:25.725823Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@64f6106c IFD 0 [2023-07-10T14:47:25.735809Z; main]: now delete 0 files: [] IFD 0 [2023-07-10T14:47:25.738456Z; main]: now checkpoint "" [0 segments ; isCommit = false] IFD 0 [2023-07-10T14:47:25.738587Z; main]: now delete 0 files: [] IFD 0 [2023-07-10T14:47:25.743719Z; main]: 2 ms to checkpoint IW 0 [2023-07-10T14:47:25.744195Z; main]: init: create=true reader=null IW 0 [2023-07-10T14:47:25.779752Z; main]: dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@319b92f3 index= version=9.7.0 analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer ramBufferSizeMB=16.0 maxBufferedDocs=-1 mergedSegmentWarmer=null delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy commit=null openMode=CREATE similarity=org.apache.lucene.search.similarities.BM25Similarity mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true codec=Lucene95 infoStream=org.apache.lucene.util.PrintStreamInfoStream mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0 readerPooling=true perThreadHardLimitMB=1945 useCompoundFile=false commitOnClose=true indexSort=null checkPendingFlushOnUpdate=true softDeletesField=null maxFullFlushMergeWaitMillis=500 leafSorter=null eventListener=org.apache.lucene.index.IndexWriterEventListener$1@10a035a0 writer=org.apache.lucene.index.IndexWriter@67b467e9 IW 0 [2023-07-10T14:47:25.780320Z; main]: MMapDirectory.UNMAP_SUPPORTED=true FP 0 [2023-07-10T14:47:27.042597Z; main]: trigger flush: activeBytes=16779458 deleteBytes=0 vs ramBufferMB=16.0 FP 0 [2023-07-10T14:47:27.045564Z; main]: thread state has 16779458 bytes; docInRAM=2589 FP 0 [2023-07-10T14:47:27.049109Z; main]: 1 in-use non-flushing threads states DWPT 0 [2023-07-10T14:47:27.050859Z; main]: flush postings as segment _0 numDocs=2589 .... Indexed 2680961 documents in 1877s ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org