Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

via GitHub Fri, 03 Jan 2025 05:47:58 -0800


mikemccand commented on PR #14078:
URL: https://github.com/apache/lucene/pull/14078#issuecomment-2569246977

+1 for proper attribution.

We should give credit where credit is due. The evolution of this PR clearly
began with the RaBitQ paper, as seen in the [opening comment on the original
PR](https://github.com/apache/lucene/pull/13651#issue-2464020838) as well as
[the original
issue](https://github.com/apache/lucene/issues/13650#issue-2463854436).

Specifically for the open source changes proposed here (this pull request
suggesting changes to Lucene's ASL2 licensed source code):

* The CHANGES.txt entry should link to both RaBitQ papers?

* The javadoc for the new `Lucene102BinaryQuantizedVectorsFormat` should
also link to both papers, and describe the provenance (e.g. the algorithm
described by these papers) along with how this implementation differs from the
original papers? We try to do this when a paper inspires changes in Lucene,
e.g. [the algorithm for efficiently building our
FSTs](https://github.com/apache/lucene/blob/204c39f8eb7fb5fd26a3b9ff41ef7d18fae1c844/lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java#L49-L50),
[the paper that inspired our block-tree terms
dictionary](https://github.com/apache/lucene/blob/204c39f8eb7fb5fd26a3b9ff41ef7d18fae1c844/lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java#L55-L57),
the [HNSW approximate KNN search
algorithm](https://github.com/apache/lucene/blob/204c39f8eb7fb5fd26a3b9ff41ef7d18fae1c844/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java#L32-L35).

Linking to the papers that inspired important changes in Lucene is not only
for proper attribution but also so users have a deep resource they can fall
back on to understand the algorithm, understand how tunable parameters are
expected to behave, etc. It's an important part of the documentation too!
Also, future developers can re-read the paper and study Lucene's implementation
and maybe find bugs / improvement ideas.

For the Elastic specific artifacts (blog posts, press releases, tweets,
etc.): I would agree that Elastic should also attribute properly, probably with
an edit/update/sorry-about-the-oversight sort of addition? But I do not (no
longer) work at Elastic, so this is merely my (external) opinion! Perhaps a
future blog post, either Elastic or someone else, could correct the mistake
(missed attribution).

Finally, thank you to @gaoj0017 and team for creating RaBitQ and publishing
these papers -- this is an impactful vector quantization algorithm that can
help the many Lucene/OpenSearch/Solr/Elasticsearch users building semantic /
LLM engines these days.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

Reply via email to