Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-04-05 Thread via GitHub
benwtrent commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2744242199 > do you confirm that, according to your knowledge, any relevant and active work toward multi-valued vectors in Lucene is effectively aggregated here? @alessandrobenedetti I thin

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-04-04 Thread via GitHub
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2744585265 re: using long for graph node ids, I can see how using int ordinals can be limiting for the no. of vectors we can index per segment. However, adapting to long node ids is also a non-

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-25 Thread via GitHub
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2751872315 > Another option I was pondering is adding a new field type dedicated to multi-valued vectors. I tried this in my first stab at this issue (https://github.com/apache/lucene/pu

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-25 Thread via GitHub
alessandrobenedetti commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2751006045 > > do you confirm that, according to your knowledge, any relevant and active work toward multi-valued vectors in Lucene is effectively aggregated here? > > @alessandr

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-23 Thread via GitHub
alessandrobenedetti commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2743201362 @vigyasharma, from a first superficial pass, I see that this PR touches similar points of my original outdated one: https://github.com/apache/lucene/pull/12314, but it see

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub
vigyasharma commented on code in PR #14173: URL: https://github.com/apache/lucene/pull/14173#discussion_r2008411867 ## lucene/core/src/java/org/apache/lucene/util/hnsw/UpdatableScoreHeap.java: ## Review Comment: I'd like to keep the logic to update scores for already ingest

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2744562872 Thanks for looking into this PR @alessandrobenedetti , this is the latest iteration on multi-vector support. It does build on the same central idea of assigning a unique ordina

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub
alessandrobenedetti commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2743148001 Catching up on this and trying to understand how far we are now from my original idea and implementation: https://github.com/apache/lucene/pull/12314 Obviously, my c

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub
alessandrobenedetti commented on code in PR #14173: URL: https://github.com/apache/lucene/pull/14173#discussion_r2007476642 ## lucene/core/src/java/org/apache/lucene/util/hnsw/UpdatableScoreHeap.java: ## Review Comment: For example, what are the benefits of this in comparis

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-02-10 Thread via GitHub
benwtrent commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2647965600 > I meant that since we'd be writing a new implementations for buildGraph etc, merging etc, it might be easier to account for long nodeIds from the get go Ah, I understand and I

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-02-09 Thread via GitHub
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2646869747 > I don't understand how DiskANN would solve any of the previously expressed problems. No it wouldn't solve any of these problems. I meant that since we'd be writing a new imp

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-02-04 Thread via GitHub
benwtrent commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2634182175 > Java limits the size of arrays (and lists) to 'int max' and does not allow 'long' array indices. These will need to be changed to use a different data structure. Yeah, I don't

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-02-01 Thread via GitHub
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2629167044 > I think this PR is still doing globally unique ordinals for vectors? So, vectors 1, 2, 3 go to document 1 and ordinals 4, 5 go to doc 2? If so, I think we should "bite the bullet"

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-01-31 Thread via GitHub
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2627799689 > I also don't understand the recall change between parentJoin on main vs. parentJoin in your branch. The parentJoin on my branch runs with merges disabled, and loads the extr

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-01-30 Thread via GitHub
benwtrent commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2624759420 I like where this PR is going. > Note: This change does not include dependent multi-valued vectors like ColBERT, where the multiple vectors must used together to compute similari

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-01-29 Thread via GitHub
benwtrent commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2621652581 > For parentJoin benchmark run on main, there is a visible drop in recall when I disable merges (as compared to a main branch run with merges enabled). Is this expected? I wonde

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-01-28 Thread via GitHub
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2620016142 Ran some early benchmarks to compare this flat storage based multi-vector approach with the existing parent-join approach. I would appreciate any feedback on the approach, benchmark

[PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-01-27 Thread via GitHub
vigyasharma opened a new pull request, #14173: URL: https://github.com/apache/lucene/pull/14173 Another take at #12313 The following PR adds support for _independent_ multi-vectors, i.e. scenarios where a single document is represented by multiple independent vector values. The most