kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2621481909
### Description 1. Separate Faiss indexes are maintained per-segment per-field, in line with Lucene's architecture (and the current vector format) 2. Vectors are buffered in memory until flush, copied over to the native process, and added to the Faiss index using a single bulk add 3. Different Faiss indexes (one for each field) are concatenated and stored in a single data file `.faissd` at flush time, and corresponding metadata information is stored in a separate file `.faissm` 4. On read time, temp files are created for separate Faiss indexes (one for each field) based on offsets stored in the meta file, read into memory, and temp files are deleted thereafter 5. On search time, the query vector is copied over to the native process, a native search is performed, and results are copied back to Java 6. Currently the ordinal -> doc id mapping is stored in Lucene and looked up at the end (can be done in Faiss using an `IDMap`, needs some investigation) ### TODOs 1. RAM and disk usage - Faiss is RAM heavy and explicitly loads most indexes into memory (as opposed to the current Lucene implementation which keeps vectors on disk, and reads them via MMAP) - The current state of the PR biases towards performance over memory and disk usage (eg indexing all docs together instead of batches, creating temp files on disk instead of using [IO wrappers](https://github.com/facebookresearch/faiss/blob/main/faiss/impl/io.h)) and can be tweaked to have a more balanced performance v/s memory and disk usage - Also lacks accurate RAM usage tracking of the Faiss library 2. Live docs as a search-time filter - The current state of the PR removes deleted docs as a post-filter instead of considering them during graph search time - These live docs are generally present as a `BitSet` (with an underlying `long[]`) and could be copied over to Faiss, which supports a filtered search (an [`IDSelectorBitmap`](https://github.com/facebookresearch/faiss/blob/9e03ef0bda4320d05d03570deb0ab14feec1054d/faiss/impl/IDSelector.h#L101) may be ideal here, but is not currently exposed via the C API) - This would also need storing doc ids directly in Faiss (using an `IDMap`) as opposed to the ordinals 3. More control over training - Some indexes in Faiss (like PQ) require [training](https://github.com/facebookresearch/faiss/blob/9e03ef0bda4320d05d03570deb0ab14feec1054d/faiss/Index.h#L108) before they can be used (to understand the document space, and create internal data structures) - The current state of the PR simply uses _all_ vectors for training to bias towards higher search-time performance over indexing-time, and we may need to expose more configurability here 4. Use more specialised native functions - For example a [native Faiss index merge](https://github.com/facebookresearch/faiss/blob/9e03ef0bda4320d05d03570deb0ab14feec1054d/faiss/Index.h#L304) during Lucene segment merges, but this has its own considerations (like deleted docs, changing doc ids, etc) 5. Double storage of vectors - Some Faiss indexes are unable to [reconstruct](https://github.com/facebookresearch/faiss/blob/9e03ef0bda4320d05d03570deb0ab14feec1054d/faiss/Index.h#L188) full-precision vectors once added - This would mean a loss of information with each merge, which is undesirable -- so we store the original vectors in Lucene as well - These vectors would increase disk usage, but not necessarily RAM as long as they are not accessed ### Long-term considerations Using a C/CPP shared library makes it difficult to debug and profile native code Handled exceptions in Faiss are gracefully rethrown in Lucene, but unhandled signals or bugs (like segmentation faults) cannot be recovered from, and kills the Java process with it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org