kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2621943260
> Maybe it can be just as fast by not reading the floating point vectors on to heap and doing memory segment stuff Interesting, do we have a Lucene PR that explores it? > Does FAISS index needs the "flat" vector storage at all? I thought FAISS gave direct access to the vector values based on ordinals? Faiss does not need (and does not use) the flat vectors at search time, and it does provide access to underlying vectors based on ordinals -- but these are "reconstructed" vectors and [may be lossy](https://github.com/facebookresearch/faiss/blob/9e03ef0bda4320d05d03570deb0ab14feec1054d/faiss/Index.h#L188) in some indexes (for example PQ) Because of this loss of information, vectors would keep getting more approximate with each merge (where we read back all vectors and create a fresh index) -- which is not desirable We could technically store the raw vectors within Faiss (say another flat index) -- but exposing them via a `FloatVectorValues` would require multiple native "reconstruct" calls. It would be similar storage-wise, so I just went with one of Lucene's flat formats (which also provides higher control over memory-mapping) > all vectors are buffered onto heap, which is pretty dang expensive +1 -- I'd like to reduce memory pressure in the future. One thing @msokolov pointed out offline is that we're using the flat format anyways -- we could flush that first and read back the vectors (but this time disk-backed -- so we reduce the double copy in memory). I'm not sure if the current APIs allow it, but I'll try to address this The least memory usage would be by adding vectors one-by-one to the Faiss index and not store them on heap at all, but I suspect this would hurt indexing performance due to multiple native calls (one per document). We could possibly index vectors in "batches" as a middle ground Also, the "train" call requires all training vectors to be passed at once -- so this is another bottleneck (i.e. we need to keep all training vectors in memory) > I can try to replicate the performance numbers Thanks, this would be super helpful! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org