LuXugang commented on PR #728:
URL: https://github.com/apache/lucene/pull/728#issuecomment-1136706160

   It seems like the core part is how to avoid that all vector values of all 
fields loaded into memory during Indexing. IIUC, as @rmuir said, we could 
stream vectors to the codec api directly.  a rough draft codec of `.vec` may 
seems like this:
   <img width="839" alt="image" 
src="https://user-images.githubusercontent.com/6985548/170176157-76bf2506-6c4b-480f-8191-919443077b15.png";>
   
   
   Just similar to how `.fdx` wrote stored values on the fly. After `.vec` file 
closed, we then read this file and build a HNSW graph.
   
   We could locate one field's part vector values in a `chunk` by node and doc 
, but surely that it is bit slower compare that one field's all vector values 
stored in one continuous interval (vector value could be random access by 
ord(node) and dimension).
   
   > If a user had 100 vector fields, then now we might have 100+ files being 
written concurrently, multiplied by the number of segments we're writing at the 
same time. It seems like this could cause problems 
   
   @jtibshirani  , or we still try to write all field's all values to a single 
temp file like the picture above , when flush triggered, we read this temp file 
and create the Lucene92's codec ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to