kaivalnp commented on PR #14178:
URL: https://github.com/apache/lucene/pull/14178#issuecomment-2621943260

   > Maybe it can be just as fast by not reading the floating point vectors on 
to heap and doing memory segment stuff
   
   Interesting, do we have a Lucene PR that explores it?
   
   > Does FAISS index needs the "flat" vector storage at all? I thought FAISS 
gave direct access to the vector values based on ordinals?
   
   Faiss does not need (and does not use) the flat vectors at search time, and 
it does provide access to underlying vectors based on ordinals -- but these are 
"reconstructed" vectors and [may be 
lossy](https://github.com/facebookresearch/faiss/blob/9e03ef0bda4320d05d03570deb0ab14feec1054d/faiss/Index.h#L188)
 in some indexes (for example PQ)
   
   Because of this loss of information, vectors would keep getting more 
approximate with each merge (where we read back all vectors and create a fresh 
index) -- which is not desirable
   
   We could technically store the raw vectors within Faiss (say another flat 
index) -- but exposing them via a `FloatVectorValues` would require multiple 
native "reconstruct" calls. It would be similar storage-wise, so I just went 
with one of Lucene's flat formats (which also provides higher control over 
memory-mapping)
   
   > all vectors are buffered onto heap, which is pretty dang expensive
   
   +1 -- I'd like to reduce memory pressure in the future. One thing @msokolov 
pointed out offline is that we're using the flat format anyways -- we could 
flush that first and read back the vectors (but this time disk-backed -- so we 
reduce the double copy in memory). I'm not sure if the current APIs allow it, 
but I'll try to address this
   
   The least memory usage would be by adding vectors one-by-one to the Faiss 
index and not store them on heap at all, but I suspect this would hurt indexing 
performance due to multiple native calls (one per document). We could possibly 
index vectors in "batches" as a middle ground
   
   Also, the "train" call requires all training vectors to be passed at once -- 
so this is another bottleneck (i.e. we need to keep all training vectors in 
memory)
   
   > I can try to replicate the performance numbers
   
   Thanks, this would be super helpful!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to