mayya-sharipova commented on pull request #728: URL: https://github.com/apache/lucene/pull/728#issuecomment-1059210295
@rmuir Thanks a lot for your review and explanation of of IndexWriter behavior. > If IndexWriter shouldn't buffer vectors, then can it simply stream vectors to the codec api? This would be similar to how StoredFields and TermVectors work today (see e.g. StoredFieldsConsumer). That's a great suggestion. I can explore how we can stream vectors directly to codec and build HNSW graphs on the fly. > I'm suspicious of the reported performance improvement based on looking at your benchmark output, I don't think its realistic. Looks like nothing else was indexed in any other way (docvalues/postings/etc), nobody ever called reopen() to force any flushes, so with the benchmark you ran, IW just wrote one big segment, avoiding all merging. So everything looks fantastic on paper, but this isn't realistic. ...It is easy to run into the same trap when benchmarking e.g. stored fields and other things. But it isn't really a performance improvement. Thanks for your comment. I guess with this patch I am addressing a scenario of initial data loading, which is common in vector search domain. There is only bulk indexing with no background searches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
