kaivalnp commented on PR #14178:
URL: https://github.com/apache/lucene/pull/14178#issuecomment-2621481909

   ### Description
   1. Separate Faiss indexes are maintained per-segment per-field, in line with 
Lucene's architecture (and the current vector format)
   2. Vectors are buffered in memory until flush, copied over to the native 
process, and added to the Faiss index using a single bulk add
   3. Different Faiss indexes (one for each field) are concatenated and stored 
in a single data file `.faissd` at flush time, and corresponding metadata 
information is stored in a separate file `.faissm`
   4. On read time, temp files are created for separate Faiss indexes (one for 
each field) based on offsets stored in the meta file, read into memory, and 
temp files are deleted thereafter
   5. On search time, the query vector is copied over to the native process, a 
native search is performed, and results are copied back to Java
   6. Currently the ordinal -> doc id mapping is stored in Lucene and looked up 
at the end (can be done in Faiss using an `IDMap`, needs some investigation)
   
   ### TODOs
   1. RAM and disk usage
        - Faiss is RAM heavy and explicitly loads most indexes into memory (as 
opposed to the current Lucene implementation which keeps vectors on disk, and 
reads them via MMAP)
        - The current state of the PR biases towards performance over memory 
and disk usage (eg indexing all docs together instead of batches, creating temp 
files on disk instead of using [IO 
wrappers](https://github.com/facebookresearch/faiss/blob/main/faiss/impl/io.h)) 
and can be tweaked to have a more balanced performance v/s memory and disk usage
        - Also lacks accurate RAM usage tracking of the Faiss library
   2. Live docs as a search-time filter
        - The current state of the PR removes deleted docs as a post-filter 
instead of considering them during graph search time
        - These live docs are generally present as a `BitSet` (with an 
underlying `long[]`) and could be copied over to Faiss, which supports a 
filtered search (an 
[`IDSelectorBitmap`](https://github.com/facebookresearch/faiss/blob/9e03ef0bda4320d05d03570deb0ab14feec1054d/faiss/impl/IDSelector.h#L101)
 may be ideal here, but is not currently exposed via the C API)
        - This would also need storing doc ids directly in Faiss (using an 
`IDMap`) as opposed to the ordinals
   3. More control over training
        - Some indexes in Faiss (like PQ) require 
[training](https://github.com/facebookresearch/faiss/blob/9e03ef0bda4320d05d03570deb0ab14feec1054d/faiss/Index.h#L108)
 before they can be used (to understand the document space, and create internal 
data structures)
        - The current state of the PR simply uses _all_ vectors for training to 
bias towards higher search-time performance over indexing-time, and we may need 
to expose more configurability here
   4. Use more specialised native functions
        - For example a [native Faiss index 
merge](https://github.com/facebookresearch/faiss/blob/9e03ef0bda4320d05d03570deb0ab14feec1054d/faiss/Index.h#L304)
 during Lucene segment merges, but this has its own considerations (like 
deleted docs, changing doc ids, etc)
   5. Double storage of vectors
        - Some Faiss indexes are unable to 
[reconstruct](https://github.com/facebookresearch/faiss/blob/9e03ef0bda4320d05d03570deb0ab14feec1054d/faiss/Index.h#L188)
 full-precision vectors once added
        - This would mean a loss of information with each merge, which is 
undesirable -- so we store the original vectors in Lucene as well
        - These vectors would increase disk usage, but not necessarily RAM as 
long as they are not accessed
   
   ### Long-term considerations
   Using a C/CPP shared library makes it difficult to debug and profile native 
code
   Handled exceptions in Faiss are gracefully rethrown in Lucene, but unhandled 
signals or bugs (like segmentation faults) cannot be recovered from, and kills 
the Java process with it!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to