Re: [I] Create a read-only index that drops index files not needed for searching [lucene]

via GitHub Tue, 21 Oct 2025 13:19:39 -0700


Pulkitg64 commented on issue #13158:
URL: https://github.com/apache/lucene/issues/13158#issuecomment-3429417165


   Coming back to this issue:
   
   ### Summary
   
   We tried to implement this idea in our closed-source implementation and got 
very good results, reduction in vector index size to one-fifth of the original 
size (80% reduction as mentioned in this issue).
   
   ### Background:
   
   At Amazon, we have decoupled architecture where Lucene writers and searchers 
run on separate machines. Writers create the index and upload it to an S3 
bucket, and searchers use the index after downloading it from the same S3 
bucket.
   
   Since full-precision float vectors are needed by writers for HNSW graph 
merging, we didn't modify anything there. However, once the index is read by 
searchers, assuming we don't need full-precision vectors anymore because only 
quantized vectors take part in vector scoring for HNSW searches, the 
full-precision vectors are just sitting idle on disk. So, in our closed-source 
implementation, we tried dropping full-precision vectors while downloading the 
checkpoint from S3 (explained in detail below).
   
   ### Implementation:
    
   In our first/naive attempt, we simply tried to remove the full-precision 
vector files directly (vec and vem files), but this caused the codec to throw 
an IndexCorruptException.
   
   Instead, here's what we did: While writing the index to the S3 bucket, 
Lucene writers uploaded additional empty full-precision vector files to S3.
   
   Normally this is how these files look like
   
   * vec:
   ```
   * HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix 
Length, Segment Suffix)
   * Offset: To adjust start of vector data position (multiplies of Float Bytes 
i.e 4 Bytes)
   * Data : 
      * Vectors: *Actual Float Vectors*
      * -1 *(To mark end of the data)* 
   * FOOTER (FooterMagic, Checksum)
   ```
   
   * vemf
   
   ```
   * HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix 
Length, Segment Suffix)
   * Data
      * Field Number
      * Vector Encoding ordinal (Bytes/Float)
      * Similarity Function ordinal (Cosine/Dot-Product etc)
      * Start position of vectors in ```.vec``` file
      * Total length of vectors
      * Vector Dimension
      * Total Vectors
      * Ord to Doc Information
      * -1 to mark end of field infos.
   * FOOTER (FooterMagic, Checksum)
   ```
   
   For our optimization, we created empty files like these:
   
   * vec:
   ```
   * HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix 
Length, Segment Suffix)
   * Offset: To adjust start of vector data position (multiplies of Float Bytes 
i.e 4 Bytes)
   * Data : <<No vector data>>
      * -1 *(To mark end of the data)* 
   * FOOTER (FooterMagic, Checksum)
   ```
   
   * vemf
   
   ```
   * HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix 
Length, Segment Suffix)
   * Data
      * Field Number
      * Vector Encoding ordinal (Bytes/Float) <<Same as original>>
      * Similarity Function ordinal (Cosine/Dot-Product etc) <<Same as 
original>>
      * Start position of vectors in ```.vec``` file <<Same as original>>
      * Total length of vectors <<Zero in this case>>
      * Vector Dimension <<Same as original>>
      * Total Vectors <<Zero in this case>>>>
      * Ord to Doc Information <<0 document information>>
      * -1 to mark end of field infos.
   * FOOTER (FooterMagic, Checksum)
   ```
   
   On Lucene searchers, while downloading the index from S3, we skipped 
downloading the original full-precision vector files and instead downloaded 
only the empty full-precision vector files. This saved us 80% of storage space 
on searchers and also reduced the downloading time from S3.
   
   ### Next Steps:
   
   Based on the above work, I wanted to know what the community thinks about 
this and whether we should implement this in the open-source Lucene repo as 
well. For example, we could add support to write empty vector files directly 
from our codec and give users the flexibility to choose whether they want to 
use full-precision files or not.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Create a read-only index that drops index files not needed for searching [lucene]

Reply via email to