msokolov commented on PR #11743: URL: https://github.com/apache/lucene/pull/11743#issuecomment-1238330802
OK, thanks for the reminder of the arguments for moving the graph creation to index time. > May be, we can come up with some sophisticated solution, writing vector values in batches to several files, but not sure if this complexity worth it. Right, we could buffer up to some % of indexwriter buffer size in RAM, and then write to a (list of) temporary file(s), freeing RAM and thenceforth accumulating new writes in RAM. Kind of like a pre-flush flush? Reading would require a wrapper that presents this all as a single VectorValues. It is more complex, but seems like it could be worthwhile since it will help reduce the pressure on the index writer to flush "prematurely," and this HNSW stuff is sensitive to being fragmented. The current situation is not terrible; eventually, merging should improve the index geometry. I don't think we have a blocker to release. At any rate for typical use cases I have in mind, the index size is still dominated by other types of fields and this is unlikely to be a problem. Although for a vectors-only index it looks worse, I think that exaggerates the typical impact? Not sure how it is looking from other perspective though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org