Pulkitg64 commented on issue #13158:
URL: https://github.com/apache/lucene/issues/13158#issuecomment-3429417165
Coming back to this issue:
### Summary
We tried to implement this idea in our closed-source implementation and got
very good results, reduction in vector index size to one-fifth of the original
size (80% reduction as mentioned in this issue).
### Background:
At Amazon, we have decoupled architecture where Lucene writers and searchers
run on separate machines. Writers create the index and upload it to an S3
bucket, and searchers use the index after downloading it from the same S3
bucket.
Since full-precision float vectors are needed by writers for HNSW graph
merging, we didn't modify anything there. However, once the index is read by
searchers, assuming we don't need full-precision vectors anymore because only
quantized vectors take part in vector scoring for HNSW searches, the
full-precision vectors are just sitting idle on disk. So, in our closed-source
implementation, we tried dropping full-precision vectors while downloading the
checkpoint from S3 (explained in detail below).
### Implementation:
In our first/naive attempt, we simply tried to remove the full-precision
vector files directly (vec and vem files), but this caused the codec to throw
an IndexCorruptException.
Instead, here's what we did: While writing the index to the S3 bucket,
Lucene writers uploaded additional empty full-precision vector files to S3.
Normally this is how these files look like
* vec:
```
* HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix
Length, Segment Suffix)
* Offset: To adjust start of vector data position (multiplies of Float Bytes
i.e 4 Bytes)
* Data :
* Vectors: *Actual Float Vectors*
* -1 *(To mark end of the data)*
* FOOTER (FooterMagic, Checksum)
```
* vemf
```
* HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix
Length, Segment Suffix)
* Data
* Field Number
* Vector Encoding ordinal (Bytes/Float)
* Similarity Function ordinal (Cosine/Dot-Product etc)
* Start position of vectors in ```.vec``` file
* Total length of vectors
* Vector Dimension
* Total Vectors
* Ord to Doc Information
* -1 to mark end of field infos.
* FOOTER (FooterMagic, Checksum)
```
For our optimization, we created empty files like these:
* vec:
```
* HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix
Length, Segment Suffix)
* Offset: To adjust start of vector data position (multiplies of Float Bytes
i.e 4 Bytes)
* Data : <<No vector data>>
* -1 *(To mark end of the data)*
* FOOTER (FooterMagic, Checksum)
```
* vemf
```
* HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix
Length, Segment Suffix)
* Data
* Field Number
* Vector Encoding ordinal (Bytes/Float) <<Same as original>>
* Similarity Function ordinal (Cosine/Dot-Product etc) <<Same as
original>>
* Start position of vectors in ```.vec``` file <<Same as original>>
* Total length of vectors <<Zero in this case>>
* Vector Dimension <<Same as original>>
* Total Vectors <<Zero in this case>>>>
* Ord to Doc Information <<0 document information>>
* -1 to mark end of field infos.
* FOOTER (FooterMagic, Checksum)
```
On Lucene searchers, while downloading the index from S3, we skipped
downloading the original full-precision vector files and instead downloaded
only the empty full-precision vector files. This saved us 80% of storage space
on searchers and also reduced the downloading time from S3.
### Next Steps:
Based on the above work, I wanted to know what the community thinks about
this and whether we should implement this in the open-source Lucene repo as
well. For example, we could add support to write empty vector files directly
from our codec and give users the flexibility to choose whether they want to
use full-precision files or not.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]