Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2025-02-11 Thread via GitHub
github-actions[bot] commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2652354185 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2025-01-28 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2620021231 I pivoted to an approach that handles independent multi-vectors within flat storage, instead of requiring index time parent-block joins. Have raised a draft PR here – #14173 -

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-12-04 Thread via GitHub
github-actions[bot] commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2518829338 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-20 Thread via GitHub
krickert commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2489546204 > Chunk-Based Highlighting – Interesting. With getAllVectorValues(), we can find all vector values with similarity above a separate sim-threshold for highlights? Not sure. But i

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-20 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2489342443 Thank you for sharing these use-cases @krickert ! 1. **Aggregate Scoring** – I think we can do this today by joining the child doc hits with their parents and calculating score

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-20 Thread via GitHub
krickert commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2488410934 > And we can use getAllVectorValues() for scoring with max or avg of all vectors in the doc at query time. Your proposal to implement `getAllVectorValues()` for scoring documents

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-20 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2487597269 _...contd. from above – thoughts on supporting independent multi-vectors specified via `NONE` multi-vector aggregation..._ __ The `Knn{Float|Byte}Vector` fields will accept

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-19 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2487589088 > My concern is that this proposal doesn’t truly add support for independent multi-vectors. That's a valid concern. I've been thinking about a more comprehensive multi-vector

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-15 Thread via GitHub
jimczi commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2478490713 > One reason to not add this would be if it makes the single vector setup hard to evolve. I'd like to understand if (and how) this is happening, and think on how we can address those conc

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-09 Thread via GitHub
krickert commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2466210357 > My current thinking is that this is a rapidly evolving field, and it's early to lean one way or another. Adding this support unlocks experimentation. Amen! This ends up

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-08 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2466073115 I tried to find some blogs and benchmarks on other library implementations. Astra Db, Vespa, faiss and nmslib, all seem to support multi-vectors in some form. From what I can

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-08 Thread via GitHub
krickert commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2464836106 I would love to see a single knn field that supports multiple vectors. Right now I feel like doing the embedded docs or a child collection to handle these use cases feel a little too

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-05 Thread via GitHub
benwtrent commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2457990323 > One use-case for multi-vectors is indexing product aspects as separate embeddings for e-commerce search. At Amazon Product Search (where I work), we'd like to experiment with separat

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-04 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2455892813 One use-case for multi-vectors is indexing product aspects as separate embeddings for e-commerce search. At Amazon Product Search (where I work), we'd like to experiment with separat

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-10-29 Thread via GitHub
jimczi commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2445295372 The more I think about it, the less I feel like the knn codec is the best choice for this feature (assuming that this issue is focused on late interaction models). > It is possible

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-10-29 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2444990835 As mentioned earlier, here is my rough plan for splitting this change into smaller PRs. Some of these steps could be merged if the impl. warrants it: 1. Multi-Vector similarity

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-10-29 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2444982753 > Maybe the first goal should be to incorporate max sim for re-ranking use cases first using a flat format This could be setup using 1) a single-vector field for hnsw matching,

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-10-29 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2444980147 Hi @jimczi , The main change in this PR is support for multi-vectors in flat readers and writers, along with a similarity spec for multiple vector values. It is possible that H

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-10-28 Thread via GitHub
jimczi commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2441247130 > it seems like single vector is a special form of multi-vector The solution really depends on the semantics. In its current form, the way multi-vectors are incorporated in this PR

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-10-27 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2439879963 > it seems like single vector is a special form of multi-vector re: single v/s multi-vectors, I think it makes sense to not force users to chose multi-valued fields upfront. T

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-10-26 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2439876776 Thanks @benwtrent. I've been working on getting a multi-vector benchmark running to wire this end to end. Found some pesky bugs and oversights. I'm planning to split this feature int

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-10-25 Thread via GitHub
benwtrent commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2438671673 Hey @vigyasharma there is a lot of good work here. I am going to shift my focus and see about how I can help here more fully. What are the next steps? I am guessing handl

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-09-26 Thread via GitHub
github-actions[bot] commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2378169293 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-09-12 Thread via GitHub
vigyasharma commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1757395276 ## lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarityFunction.java: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software Foundation (ASF) u

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-09-12 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2346995734 > Is "default run" from this PR? No. "default run" is knn search where each embedding is a separate document with no relationship between them. I'm still wiring things up to se

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-09-11 Thread via GitHub
benwtrent commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2344312427 @msokolov I saw recently you were working on a major refactor where we just make every vector access random access. I think this might make the changes in this PR simpler as we won't h

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-09-11 Thread via GitHub
benwtrent commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1755216443 ## lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarityFunction.java: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-09-03 Thread via GitHub
github-actions[bot] commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2327674839 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-08-19 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2297380564 > This PR has not had activity in the past 2 weeks, labeling it as stale... Just to update on some activity here: I'm working on parent block join benchmarks in `luceneutil`

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-08-08 Thread via GitHub
github-actions[bot] commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2276933809 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-25 Thread via GitHub
benwtrent commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1691432866 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatMultiVectorsFormat.java: ## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundatio

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-25 Thread via GitHub
benwtrent commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2250298577 > I think it's awesome to invest in our benchmarking tooling to be able to test different approaches for multi-valued vectors, but, I don't think that should be a blocker to merging th

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-25 Thread via GitHub
mikemccand commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2250217734 I think it's awesome to invest in our benchmarking tooling to be able to test different approaches for multi-valued vectors, but, I don't think that should be a blocker to merging thi

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-25 Thread via GitHub
mikemccand commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2250212492 > I started adding support for ParentJoin benchmarks ([issue](https://github.com/mikemccand/luceneutil/issues/284)). Will raise it in multiple small PRs, here's the [first one](https

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-25 Thread via GitHub
mikemccand commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1691360151 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatTensorsFormat.java: ## @@ -76,6 +76,7 @@ * * @lucene.experimental */ +// no commit Revi

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-24 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2248625418 I started adding support for ParentJoin benchmarks ([issue](https://github.com/mikemccand/luceneutil/issues/284)). Will raise it in multiple small PRs, here's the [first one](https:

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-23 Thread via GitHub
benwtrent commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2244972179 @vigyasharma > do we have any existing benchmarks for ParentJoin queries in knn? No, we do not. I ended up writing a bunch of throw away code to benchmark latency and rec

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-22 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2244027209 > Cohere's wikipedia embeddings all indicate their parent page. So, I wonder how this would work on finding the nearest page given the `maxsim(passage)` vs. using the Lucene join log

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-18 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2237194980 > The pattern doesn't work well with ColBERT esque models. +1.. Good question, @navneet1v. I had the same doubts before starting this effort. There is some discussion in [1231

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-16 Thread via GitHub
benwtrent commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2232052499 @navneet1v The pattern doesn't work well with ColBERT esque models. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-16 Thread via GitHub
navneet1v commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2231959095 @vigyasharma is there a reason to adding the multi vector field support and not use the parent child relationship of the documents to fulfill this use case? -- This is an automated m

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-12 Thread via GitHub
cpoerschke commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1675739599 ## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ## @@ -452,7 +465,8 @@ synchronized int addOrGet(FieldInfo fi) { new FieldVectorPr

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-12 Thread via GitHub
cpoerschke commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1675737097 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -1527,15 +1549,20 @@ void setPoints(int dimensionCount, int indexDimensionCount, int numB

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-12 Thread via GitHub
cpoerschke commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1675735652 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatMultiVectorsWriter.java: ## @@ -0,0 +1,824 @@ +/* + * Licensed to the Apache Software Foundati

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-12 Thread via GitHub
cpoerschke commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1675724584 ## lucene/core/src/java/org/apache/lucene/index/FieldInfo.java: ## @@ -92,6 +97,8 @@ public FieldInfo( int vectorDimension, VectorEncoding vectorEncod