github-actions[bot] commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2652354185
This PR has not had activity in the past 2 weeks, labeling it as stale. If
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you
for your contributi
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2620021231
I pivoted to an approach that handles independent multi-vectors within flat
storage, instead of requiring index time parent-block joins. Have raised a
draft PR here – #14173
-
github-actions[bot] commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2518829338
This PR has not had activity in the past 2 weeks, labeling it as stale. If
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you
for your contributi
krickert commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2489546204
> Chunk-Based Highlighting – Interesting. With getAllVectorValues(), we can
find all vector values with similarity above a separate sim-threshold for
highlights?
Not sure. But i
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2489342443
Thank you for sharing these use-cases @krickert !
1. **Aggregate Scoring** – I think we can do this today by joining the child
doc hits with their parents and calculating score
krickert commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2488410934
> And we can use getAllVectorValues() for scoring with max or avg of all
vectors in the doc at query time.
Your proposal to implement `getAllVectorValues()` for scoring documents
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2487597269
_...contd. from above – thoughts on supporting independent multi-vectors
specified via `NONE` multi-vector aggregation..._
__
The `Knn{Float|Byte}Vector` fields will accept
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2487589088
> My concern is that this proposal doesn’t truly add support for
independent multi-vectors.
That's a valid concern. I've been thinking about a more comprehensive
multi-vector
jimczi commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2478490713
> One reason to not add this would be if it makes the single vector setup
hard to evolve. I'd like to understand if (and how) this is happening, and
think on how we can address those conc
krickert commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2466210357
> My current thinking is that this is a rapidly evolving field, and it's
early to lean one way or another. Adding this support unlocks experimentation.
Amen!
This ends up
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2466073115
I tried to find some blogs and benchmarks on other library implementations.
Astra Db, Vespa, faiss and nmslib, all seem to support multi-vectors in some
form.
From what I can
krickert commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2464836106
I would love to see a single knn field that supports multiple vectors.
Right now I feel like doing the embedded docs or a child collection to handle
these use cases feel a little too
benwtrent commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2457990323
> One use-case for multi-vectors is indexing product aspects as separate
embeddings for e-commerce search. At Amazon Product Search (where I work), we'd
like to experiment with separat
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2455892813
One use-case for multi-vectors is indexing product aspects as separate
embeddings for e-commerce search. At Amazon Product Search (where I work), we'd
like to experiment with separat
jimczi commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2445295372
The more I think about it, the less I feel like the knn codec is the best
choice for this feature (assuming that this issue is focused on late
interaction models).
> It is possible
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2444990835
As mentioned earlier, here is my rough plan for splitting this change into
smaller PRs. Some of these steps could be merged if the impl. warrants it:
1. Multi-Vector similarity
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2444982753
> Maybe the first goal should be to incorporate max sim for re-ranking use
cases first using a flat format
This could be setup using 1) a single-vector field for hnsw matching,
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2444980147
Hi @jimczi , The main change in this PR is support for multi-vectors in flat
readers and writers, along with a similarity spec for multiple vector values.
It is possible that H
jimczi commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2441247130
> it seems like single vector is a special form of multi-vector
The solution really depends on the semantics. In its current form, the way
multi-vectors are incorporated in this PR
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2439879963
> it seems like single vector is a special form of multi-vector
re: single v/s multi-vectors, I think it makes sense to not force users to
chose multi-valued fields upfront. T
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2439876776
Thanks @benwtrent. I've been working on getting a multi-vector benchmark
running to wire this end to end. Found some pesky bugs and oversights. I'm
planning to split this feature int
benwtrent commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2438671673
Hey @vigyasharma there is a lot of good work here.
I am going to shift my focus and see about how I can help here more fully.
What are the next steps?
I am guessing handl
github-actions[bot] commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2378169293
This PR has not had activity in the past 2 weeks, labeling it as stale. If
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you
for your contributi
vigyasharma commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1757395276
##
lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarityFunction.java:
##
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) u
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2346995734
> Is "default run" from this PR?
No. "default run" is knn search where each embedding is a separate document
with no relationship between them. I'm still wiring things up to se
benwtrent commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2344312427
@msokolov I saw recently you were working on a major refactor where we just
make every vector access random access. I think this might make the changes in
this PR simpler as we won't h
benwtrent commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1755216443
##
lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarityFunction.java:
##
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) und
github-actions[bot] commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2327674839
This PR has not had activity in the past 2 weeks, labeling it as stale. If
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you
for your contributi
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2297380564
> This PR has not had activity in the past 2 weeks, labeling it as stale...
Just to update on some activity here:
I'm working on parent block join benchmarks in `luceneutil`
github-actions[bot] commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2276933809
This PR has not had activity in the past 2 weeks, labeling it as stale. If
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you
for your contributi
benwtrent commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1691432866
##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatMultiVectorsFormat.java:
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundatio
benwtrent commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2250298577
> I think it's awesome to invest in our benchmarking tooling to be able to
test different approaches for multi-valued vectors, but, I don't think that
should be a blocker to merging th
mikemccand commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2250217734
I think it's awesome to invest in our benchmarking tooling to be able to
test different approaches for multi-valued vectors, but, I don't think that
should be a blocker to merging thi
mikemccand commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2250212492
> I started adding support for ParentJoin benchmarks
([issue](https://github.com/mikemccand/luceneutil/issues/284)). Will raise it
in multiple small PRs, here's the [first
one](https
mikemccand commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1691360151
##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatTensorsFormat.java:
##
@@ -76,6 +76,7 @@
*
* @lucene.experimental
*/
+// no commit
Revi
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2248625418
I started adding support for ParentJoin benchmarks
([issue](https://github.com/mikemccand/luceneutil/issues/284)). Will raise it
in multiple small PRs, here's the [first
one](https:
benwtrent commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2244972179
@vigyasharma
> do we have any existing benchmarks for ParentJoin queries in knn?
No, we do not. I ended up writing a bunch of throw away code to benchmark
latency and rec
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2244027209
> Cohere's wikipedia embeddings all indicate their parent page. So, I wonder
how this would work on finding the nearest page given the `maxsim(passage)` vs.
using the Lucene join log
vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2237194980
> The pattern doesn't work well with ColBERT esque models.
+1.. Good question, @navneet1v. I had the same doubts before starting this
effort. There is some discussion in
[1231
benwtrent commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2232052499
@navneet1v
The pattern doesn't work well with ColBERT esque models.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to Git
navneet1v commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2231959095
@vigyasharma is there a reason to adding the multi vector field support and
not use the parent child relationship of the documents to fulfill this use case?
--
This is an automated m
cpoerschke commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1675739599
##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -452,7 +465,8 @@ synchronized int addOrGet(FieldInfo fi) {
new FieldVectorPr
cpoerschke commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1675737097
##
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##
@@ -1527,15 +1549,20 @@ void setPoints(int dimensionCount, int
indexDimensionCount, int numB
cpoerschke commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1675735652
##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatMultiVectorsWriter.java:
##
@@ -0,0 +1,824 @@
+/*
+ * Licensed to the Apache Software Foundati
cpoerschke commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1675724584
##
lucene/core/src/java/org/apache/lucene/index/FieldInfo.java:
##
@@ -92,6 +97,8 @@ public FieldInfo(
int vectorDimension,
VectorEncoding vectorEncod
45 matches
Mail list logo