Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-11 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2710872661 @benwtrent @mikemccand I really appreciate your help and quick responses. May I also ask about the selection of datasets being used for the benchmarks? How do you choose them? Why I'm

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-11 Thread via GitHub
mikemccand commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2710460602 > `fanout` makes the search queue when searching the HNSW graph larger. However, the searcher will still only return `k` results. So, searching for top `k=10` with `fanout=20` indicat

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-11 Thread via GitHub
mikemccand commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2710473036 > > Could you please also share other parameters of your benchmark (ndoc, maxConn, beamWidthIndex, fanout, etc.) > > I have lost my test environment and I regrettably didn't wri

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-11 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2713454213 Hey @lpld > May I also ask about the selection of datasets being used for the benchmarks? How do you choose them? I haven't tested with SIFT, though be sure to use euclid

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-10 Thread via GitHub
mikemccand commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2710482360 > @lpld quantizing is done per segment, at flush and merge time. So it takes into account live vectors in the segment during flush and merge. > > I don't see why adding/updating

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-10 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2710314969 @lpld quantizing is done per segment, at flush and merge time. So it takes into account live vectors in the segment during flush and merge. I don't see why adding/updating/deleti

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-09 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2709087834 @benwtrent A short question again. Does this quantization approach in principle applicable when my data is constantly changing, i.e. new vectors are being added and old vectors removed from

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-07 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2706558384 @benwtrent This makes sense, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-07 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2706428269 @lpld I agree, both are doing similar things but there are some important distinctions. `oversample` indicates that you are going to return that ratio more results from

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-05 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2702238939 @benwtrent Thanks for your response, it was quite helpful. Could you please also share other parameters of your benchmark (ndoc, maxConn, beamWidthIndex, fanout, etc.) ? I was able t

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-04 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2698684914 @lpld here is my Lucene util changes: https://github.com/mikemccand/luceneutil/pull/348 > What exactly do the numbers in the description of this pull request mean? When you say

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-04 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2698569925 Hi @benwtrent Thanks again for your previous comment. I was able to modify luceneutil and run some benchmarks. I am quite new to lucene, so I would appreciate some help in understan

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-04 Thread via GitHub
benwtrent merged PR #14078: URL: https://github.com/apache/lucene/pull/14078 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-01 Thread via GitHub
gaoj0017 commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2692580673 Thank you for acknowledging that our extended RaBitQ method proposes the idea of exploring different scalar quantization parameters on a per-vector basis for the first time and OSQ adop

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-26 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2684703383 @benwtrent Thanks for your reply! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-25 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2681113395 @benwtrent Thanks for your reply! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-24 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2679008833 @gaoj0017 > The OSQ method (introduced in this PR) has its major idea similar to our extended RaBitQ method and our extended RaBitQ method is a prior art which achieves good ac

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-24 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2679022162 > I wonder where can I find the code for the benchmarks that you are mentioning in the description? Thanks! @lpld I patched a version of Lucene util, sort of like this: https://

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-24 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2678033882 Hi @benwtrent, that's an amazing amount of work. I wonder where can I find the code for the benchmarks that you are mentioning in the description? Thanks! -- This is an automated m

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-17 Thread via GitHub
gaoj0017 commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2664639299 As we have consistently emphasized in both public and private communications, we are concerned that the **OSQ method employs an idea highly similar to the one presented in our [extended

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-11 Thread via GitHub
tveasey commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2652753715 This pull request relates only to OSQ, and thus the proper scope of discussion is regarding the concerns raised around its attribution. We have pursued multiple conversations and d

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-07 Thread via GitHub
gaoj0017 commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2644556568 After Elastic’s last round of replies, the Elastic team reached us for clarification on the issues via zoom meetings. In the meetings, they promised to fix the misattribution, so we sus

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-01-30 Thread via GitHub
github-actions[bot] commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2626006980 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-01-06 Thread via GitHub
tveasey commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2573444675 Just sticking purely to the issues raised regarding this PR and the blog Ben linked explaining the methodology... > Although the RaBitQ approach is conceptually rather different to

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-01-06 Thread via GitHub
ChrisHegarty commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2573298347 In my capacity as the Lucene PMC Chair (and with explicit acknowledgment of my current employment with Elastic, as of the date of this writing), I want to emphasize that proper attr

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-01-06 Thread via GitHub
gaoj0017 commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2573030521 Hi @msokolov , the discussion here is not only about the blog posts but also related to the pull request here. In this pull request (and its related blogs), it claims a new method witho

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-01-03 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2569565637 To head this off, this implementation is not an evolution of RabitQ in any way. It's intellectually dishonest to say it's an evolution of RaBitQ. I know that's pedantic, but it's a fac

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-01-03 Thread via GitHub
mikemccand commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2569246977 +1 for proper attribution. We should give credit where credit is due. The evolution of this PR clearly began with the RaBitQ paper, as seen in the [opening comment on the origi

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2024-12-30 Thread via GitHub
msokolov commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2565588002 @gaoj0017 it sounds to me as if your concern is about lack of attribution in the blog post you mentioned, and doesn't really relate to this pull request (code change) - is that accurate

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2024-12-26 Thread via GitHub
gaoj0017 commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2562753433 @benwtrent Thanks for your reply. First, in the blog - [Better Binary Quantization at Elastic and Lucene](https://www.elastic.co/search-labs/blog/better-binary-quantization-lucen

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2024-12-19 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2555050252 @gaoj0017 Thank you for your feedback! Truly, y'all inspired us on improving scalar quantization. RaBitQ showed that it is possible to achieve 32x reduction while achieving high

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2024-12-18 Thread via GitHub
mayya-sharipova commented on code in PR #14078: URL: https://github.com/apache/lucene/pull/14078#discussion_r1890793535 ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsFormat.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Soft

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2024-12-18 Thread via GitHub
mayya-sharipova commented on code in PR #14078: URL: https://github.com/apache/lucene/pull/14078#discussion_r1890779090 ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsFormat.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Soft

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2024-12-18 Thread via GitHub
mayya-sharipova commented on code in PR #14078: URL: https://github.com/apache/lucene/pull/14078#discussion_r1890778000 ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/package-info.java: ## @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2024-12-17 Thread via GitHub
gaoj0017 commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2550510539 Hi @benwtrent , I am the first author of the [RaBitQ paper](https://arxiv.org/abs/2405.12497) and [its extended version](https://arxiv.org/abs/2409.09913). As your team have known, our

[PR] Binary vector format for flat and hnsw vectors [lucene]

2024-12-17 Thread via GitHub
benwtrent opened a new pull request, #14078: URL: https://github.com/apache/lucene/pull/14078 This provides a binary vector format for vectors. The key ideas are: - Centroid centered vectors - Asymmetric quantization - Individually optimized scalar quantization This all