[PR] Remove duplicate test code [lucene]

2024-11-08 Thread via GitHub
LuXugang opened a new pull request, #13982: URL: https://github.com/apache/lucene/pull/13982 ### Description The only diff between `doIterate1 ` and `doIterate2 ` is `bb < b.length()` in `doIterate1 `, but this condition seems always true, so should we remove it? -- This is an auto

[PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-08 Thread via GitHub
benchaplin opened a new pull request, #13984: URL: https://github.com/apache/lucene/pull/13984 ### Description This is a first pass on #13724 - I've added the following checks: - verify neighbor list is in order - verify no duplicate neighbors The code was adapted from [

Re: [I] CheckIndex could assert more things about the HNSW graph? [lucene]

2024-11-08 Thread via GitHub
benchaplin commented on issue #13724: URL: https://github.com/apache/lucene/issues/13724#issuecomment-2465431906 I've been interested in working on some HNSW stuff so I took a shot at this. I added only the two checks you mentioned to start, but I'm planning to look through some issues and

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-08 Thread via GitHub
ChrisHegarty commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2465408658 > [the last commit](https://github.com/apache/lucene/pull/13872/commits/ef13bade9b3ab8eb5de35e7297a9468ec45fc745) demonstrates re-using the resource-intensive bits (RandomVectorScor

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-08 Thread via GitHub
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2466073115 I tried to find some blogs and benchmarks on other library implementations. Astra Db, Vespa, faiss and nmslib, all seem to support multi-vectors in some form. From what I can

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-08 Thread via GitHub
msokolov commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2465450838 Yeah I think I will stop working on this. Where I've ended up is with suspicions about the ScorerSupplier/Scorer setup. Because we allocate Scorers *a lot* we rely on that being a chea

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-08 Thread via GitHub
msokolov commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2465298380 [the last commit](https://github.com/apache/lucene/pull/13872/commits/ef13bade9b3ab8eb5de35e7297a9468ec45fc745) demonstrates re-using the resource-intensive bits (RandomVectorScorer and

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-08 Thread via GitHub
msokolov commented on code in PR #13872: URL: https://github.com/apache/lucene/pull/13872#discussion_r1834763594 ## lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java: ## @@ -118,6 +123,9 @@ public Floats vectors() { public float[] get(int ord) throws

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-08 Thread via GitHub
ChrisHegarty commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2465028558 >As a side note, I see that IndexInput is Closeable -- but I don't think we ever close() the ones we allocate? Is this a problem? Not a problem per se. The creator of the scor

[I] KnnFloatVectorQuery#toString should show the filter [lucene]

2024-11-08 Thread via GitHub
jpountz opened a new issue, #13983: URL: https://github.com/apache/lucene/issues/13983 ### Description I was looking at the `toString()` of a `KnnFloatVectorQuery` and got surprised that it was not pre-filtered, when I later discovered that it actually was, `toString()` was just omit

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-08 Thread via GitHub
ChrisHegarty commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2464371689 The threading and reuse model in the current API is subtle, but works well, as it mostly minimises the potential for garbage. Looking at this again, it's almost like anywhere

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-08 Thread via GitHub
ChrisHegarty commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2464599913 Looking again in a bit more detail, I dislike that it is not possible to get access to the vector values without *always* creating at least one slice. Previously this was not necess

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-08 Thread via GitHub
ChrisHegarty commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2464638368 I am now completely confused by this change. And I think that my last commit broke the threading model - even though all tests pass!. I'm not saying that we should not do some refac

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-08 Thread via GitHub
ChrisHegarty commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2464539084 @msokolov ok, this is that I was thinking - [f1e0007](https://github.com/apache/lucene/pull/13872/commits/f1e0007b15245db3a048ebf5bf340e70929f21b7). There might be other places too,

[PR] Change vector input from IndexInput to RandomAccessInput [lucene]

2024-11-08 Thread via GitHub
dungba88 opened a new pull request, #13981: URL: https://github.com/apache/lucene/pull/13981 ### Description This PR changes the input of vector reader from IndexInput to RandomAccessInput, as we mostly random-access the vectors with seek() followed by readFloats(). One thing

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-11-08 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2464679186 @ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together. -- This is an au

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-08 Thread via GitHub
msokolov commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2464749787 OK; re allocating vectors in supplier, yes that can't work because of aliasing across threads. I also found we do not adequately test this case. There are *many* things in here that ar

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-08 Thread via GitHub
krickert commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2464836106 I would love to see a single knn field that supports multiple vectors. Right now I feel like doing the embedded docs or a child collection to handle these use cases feel a little too