Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-11-08 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2464679186 @ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together. -- This is an au

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-11-07 Thread via GitHub
ShashwatShivam commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2463415182 @benwtrent makes sense, I wasn't accounting for the fact that the floating vectors are being stored too. I guess I should have instead asked how to reproduce the 'memory required'

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-11-07 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2462723469 @ShashwatShivam why do you think the index size (total size of all the files) should be smaller? We store the binary quantized vectors and the floating point vectors. So, I woul

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-11-07 Thread via GitHub
ShashwatShivam commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2462601593 @benwtrent thanks for giving the link to the testing script, it works! One question - the index size it reports is larger than the HNSW index size. For e.g. I was working with a C

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-11-05 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2457031733 Hey @ShashwatShivam https://github.com/mikemccand/luceneutil/compare/main...benwtrent:luceneutil:bbq that is the testing script I use. But if Lucene has since been update

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-11-05 Thread via GitHub
ShashwatShivam commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2456984008 Hi Ben, I'm trying to get a benchmark run for RaBitQ using luceneutil (https://github.com/mikemccand/luceneutil), but I'm facing some missing files issue - java.lang.NoClassDefFou

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-10-30 Thread via GitHub
mayya-sharipova commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1823251497 ## lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101BinaryQuantizedVectorsFormat.java: ## @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache Soft

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-10-21 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2427907718 Here is some Lucene Util Benchmarking. Some of these numbers actually contradict some of my previous benchmarking for int4. Which is frustrating, I wonder what I did wrong then or now.

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-10-18 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2423186168 I will open a PR against Lucene Util to update it to utilize these formats and show y'all some runs with it soon. But The PR is ready for general review. -- This is an automated mess

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-10-16 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2417284114 I am currently working on moving this to Lucene101 format with the bug fixes we discovered in additional testing. -- This is an automated message from the Apache Git Service. To res

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-18 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2359272883 Here is some more flat index test results. This was to exercise and see how the number of coarse grained centroids changes recall & speed. | Lucene912BinaryQuantizedVectorsForma

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-17 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2356359326 @ShashwatShivam so, the flat codec version is sneaky, depending on when you cloned the repo, it might not be doing anything Lucene by default will return nothing for approx

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-17 Thread via GitHub
ShashwatShivam commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2356278997 Following up on the above comment by tanyaroosta, the dataset I was using for benchmarking RaBitQ through Luceneutil (main branch) was amazon's ASIN and query embeddings (which ar

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-17 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2356243291 @tanyaroosta we are still doing larger scale testing, but if you want to test with LuceneUtil, here is the branch I am using: https://github.com/mikemccand/luceneutil/compare/main...be

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-17 Thread via GitHub
tanyaroosta commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2356189954 @benwtrent we are trying to run tests with the RaBitQ Lucene implementation, and are not able to replicate the numbers reported in the paper. Have you run tests as part of the imple

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-03 Thread via GitHub
john-wagster commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1742666476 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java: ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-03 Thread via GitHub
benwtrent commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1742505038 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java: ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (AS

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-03 Thread via GitHub
benwtrent commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1742505038 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java: ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (AS

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-26 Thread via GitHub
benwtrent commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1731850236 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912BinaryFlatVectorsScorer.java: ## @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software Founda

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-26 Thread via GitHub
benwtrent commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1731849841 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912BinaryFlatVectorsScorer.java: ## @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software Founda

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-26 Thread via GitHub
mayya-sharipova commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1731836389 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912BinaryFlatVectorsScorer.java: ## @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-26 Thread via GitHub
mayya-sharipova commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1731835612 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912BinaryFlatVectorsScorer.java: ## @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-26 Thread via GitHub
rmuir commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1731609986 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -761,4 +763,81 @@ private static int squareDistanceBody128(MemoryS

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-26 Thread via GitHub
ChrisHegarty commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1731499127 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -761,4 +763,81 @@ private static int squareDistanceBody128(

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-26 Thread via GitHub
ChrisHegarty commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1731488489 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -761,4 +763,81 @@ private static int squareDistanceBody128(

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-26 Thread via GitHub
ChrisHegarty commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1731488489 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -761,4 +763,81 @@ private static int squareDistanceBody128(

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-26 Thread via GitHub
rmuir commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1731175737 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -761,4 +763,81 @@ private static int squareDistanceBody128(MemoryS

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-26 Thread via GitHub
rmuir commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1731174206 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -761,4 +763,81 @@ private static int squareDistanceBody128(MemoryS

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-21 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2302857403 100MB assumes that even when compressed, it's a single byte per centroid. 100M vectors might only have 2 centroids and thus only need two bits two store. Also, I would expect the

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-21 Thread via GitHub
mayya-sharipova commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2302685585 > possibly switch to LongValues for storing vectorOrd -> centroidOrd mapping I was thinking about adding centroids mappings as LongValues at the end of meta file, but this

[PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-08-13 Thread via GitHub
benwtrent opened a new pull request, #13651: URL: https://github.com/apache/lucene/pull/13651 # Not only a draft, but a very rough one indeed Not opening for the sake of review, but just openness and for those curious about the work. # Highlevel design RaBitQ is basicall