msokolov commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1238330802

   OK, thanks for the reminder of the arguments for moving the graph creation 
to index time.
   
   > May be, we can come up with some sophisticated solution, writing vector 
values in batches to several files, but not sure if this complexity worth it.
   
   Right, we could buffer up to some % of indexwriter buffer size in RAM, and 
then write to a (list of) temporary file(s), freeing RAM and thenceforth 
accumulating new writes in RAM. Kind of like a pre-flush flush? Reading would 
require a wrapper that presents this all as a single VectorValues. It is more 
complex, but seems like it could be worthwhile since it will help reduce the 
pressure on the index writer to flush "prematurely," and this HNSW stuff is 
sensitive to being fragmented.
   
   The current situation is not terrible; eventually, merging should improve 
the index geometry. I don't think we have a blocker to release. At any rate for 
typical use cases I have in mind, the index size is still dominated by other 
types of fields and this is unlikely to be a problem. Although for a 
vectors-only index it looks worse, I think that exaggerates the typical impact? 
Not sure how it is looking from other perspective though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to