Re: [I] Create a read-only index that drops index files not needed for searching [lucene]

via GitHub Fri, 12 Dec 2025 18:48:40 -0800


mikemccand commented on issue #13158:
URL: https://github.com/apache/lucene/issues/13158#issuecomment-3648791542


   > This saved us 80% of storage space on searchers and also reduced the 
downloading time from S3.
   
   Just to clarify -- this is 80% smaller storage for just the vectors portion 
of the index.  We (Amazon customer facing product search team -- I work with 
@Pulkitg64 and @msokolov) still have lots of other things in the Lucene index!  
Overall top-line reduction I think was ~10%, but that equates to PB (petabytes) 
of savings each day across the whole fleet!  And as more and more vectors, with 
higher and higher dimensionality, are added to our indices, the vector portion 
of the index is a larger part, and these savings get bigger over time.
   
   We can only do this because we fully rely on scalar quantized vectors for 
searching ... e.g. we never do 2nd phase reranking with full precision vectors. 
 If a query wants to retrieve a vector as a return field, we re-hydrate the 
quantized form back to (lossy due to quantization round trip) full precision.  
And also because we have physical isolation of indexing and searching, using 
NRT segment replication (via S3 so we also get full, incremental backups on 
every commit point) to copy new segments on each commit from indexers to 
searchers.
   
   As @Pulkitg64 described, our current solution is kinda hackity/messy because 
the Codec (and therefore `IndexWriter`, `SegmentInfos`, etc.) don't know about 
these files.
   
   > For example, we could add support to write empty vector files directly 
from our codec and give users the flexibility to choose whether they want to 
use full-precision files or not.
   
   +1, I like this approach.
   
   Indexing would always write two sets of files (one with all the full 
precision vectors indexed, another with no vectors which will be tiny files -- 
just header and footer).  Codec would own the inventory of these 
empty-full-precision files (adding them to `.files()`).  And then they would be 
deleted at the right time since `IndexFileDeleter` would include them in ref 
counting (when that segment is merged away).  And nothing would open them for 
reading, by default... they just "lurk"!
   
   > A few questions:
   
   Yeah this is the tricky part!
   
   The one thing that needs to know it will open only for reading is the 
`KnnVectorsReader`, but we have no clean way to pass index-open-time parameters 
to Codecs I think?  But you're right, we would also need the index to somehow 
store that it is a read-only index because something along the way dropped the 
full precision files?  Not sure how to do that ... tricky part!  It makes the 
empty-file-writing part seem easy lol.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Create a read-only index that drops index files not needed for searching [lucene]

Reply via email to