mikemccand commented on issue #13158: URL: https://github.com/apache/lucene/issues/13158#issuecomment-3648791542
> This saved us 80% of storage space on searchers and also reduced the downloading time from S3. Just to clarify -- this is 80% smaller storage for just the vectors portion of the index. We (Amazon customer facing product search team -- I work with @Pulkitg64 and @msokolov) still have lots of other things in the Lucene index! Overall top-line reduction I think was ~10%, but that equates to PB (petabytes) of savings each day across the whole fleet! And as more and more vectors, with higher and higher dimensionality, are added to our indices, the vector portion of the index is a larger part, and these savings get bigger over time. We can only do this because we fully rely on scalar quantized vectors for searching ... e.g. we never do 2nd phase reranking with full precision vectors. If a query wants to retrieve a vector as a return field, we re-hydrate the quantized form back to (lossy due to quantization round trip) full precision. And also because we have physical isolation of indexing and searching, using NRT segment replication (via S3 so we also get full, incremental backups on every commit point) to copy new segments on each commit from indexers to searchers. As @Pulkitg64 described, our current solution is kinda hackity/messy because the Codec (and therefore `IndexWriter`, `SegmentInfos`, etc.) don't know about these files. > For example, we could add support to write empty vector files directly from our codec and give users the flexibility to choose whether they want to use full-precision files or not. +1, I like this approach. Indexing would always write two sets of files (one with all the full precision vectors indexed, another with no vectors which will be tiny files -- just header and footer). Codec would own the inventory of these empty-full-precision files (adding them to `.files()`). And then they would be deleted at the right time since `IndexFileDeleter` would include them in ref counting (when that segment is merged away). And nothing would open them for reading, by default... they just "lurk"! > A few questions: Yeah this is the tricky part! The one thing that needs to know it will open only for reading is the `KnnVectorsReader`, but we have no clean way to pass index-open-time parameters to Codecs I think? But you're right, we would also need the index to somehow store that it is a read-only index because something along the way dropped the full precision files? Not sure how to do that ... tricky part! It makes the empty-file-writing part seem easy lol. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
