Re: [PR] Optimize DFS while marking connected components (#14022) [lucene]

2025-01-06 Thread via GitHub
viswanathk commented on PR #14105: URL: https://github.com/apache/lucene/pull/14105#issuecomment-2574269727 My bad. Made the changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
jpountz commented on code in PR #14097: URL: https://github.com/apache/lucene/pull/14097#discussion_r1904681561 ## lucene/misc/src/java/org/apache/lucene/misc/index/BpVectorReorderer.java: ## @@ -0,0 +1,788 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
benwtrent commented on PR #14097: URL: https://github.com/apache/lucene/pull/14097#issuecomment-2573997682 > I didnt' see a big impact on recall beyond what is typical from noise -- even with the same graph settings we see variance in recall due to randomness in the graph creation when ther

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2025-01-06 Thread via GitHub
benchaplin commented on PR #13984: URL: https://github.com/apache/lucene/pull/13984#issuecomment-2573974895 @mikemccand I tried for a bit to recreate the scenario you're describing but wasn't able to. I added the defensive suggestion anyway - I'll keep trying to reproduce. Let me know if yo

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
msokolov commented on PR #14097: URL: https://github.com/apache/lucene/pull/14097#issuecomment-2573974658 > The numbers here are really nice. I just want to understand why they were better, especially as recall changes, which seems to indicate that the graph building itself is being changed

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2025-01-06 Thread via GitHub
benchaplin commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1904663757 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2785,191 @@ public static Status.VectorValuesStatus testVectors( return status;

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2025-01-06 Thread via GitHub
msokolov commented on PR #13984: URL: https://github.com/apache/lucene/pull/13984#issuecomment-2573971976 Ooh, I was wondering about this `PerFieldKnnVectorsFormat` case - thanks for testing @mikemccand -- This is an automated message from the Apache Git Service. To respond to the me

Re: [PR] Implement ACORN-1 search for HNSW [lucene]

2025-01-06 Thread via GitHub
benwtrent commented on PR #14085: URL: https://github.com/apache/lucene/pull/14085#issuecomment-2573950904 Thank you for taking a stab at this @benchaplin ! I wonder if we can adjust the algorithm to more intelligently switch between the algorithms. something like: - Fan out one lay

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
benwtrent commented on PR #14097: URL: https://github.com/apache/lucene/pull/14097#issuecomment-2573940308 If we really think `vint` is the cause, I wonder if we should switch encoding to the `readGroupVInts` stuff? https://github.com/apache/lucene/issues/12871 My thought around

Re: [PR] Optimize DFS while marking connected components (#14022) [lucene]

2025-01-06 Thread via GitHub
msokolov commented on PR #14105: URL: https://github.com/apache/lucene/pull/14105#issuecomment-2573865042 thanks! I merged the CHANGES ... but ... maybe we also need to backport the CHANGES change :yum: -- could add it to the backport PR? -- This is an automated message from the Apach

Re: [PR] Add CHANGES.txt entry for HNSW DFS Optimization #14022 [lucene]

2025-01-06 Thread via GitHub
msokolov merged PR #14104: URL: https://github.com/apache/lucene/pull/14104 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
mikemccand commented on PR #14097: URL: https://github.com/apache/lucene/pull/14097#issuecomment-2573851904 I like the more efficient delta encoding theory. Decoding `vInt` is a hotspot for HNSW graph traversal ... so if we can use 2 bytes instead of 3, or 1 byte instead of 2, thanks

Re: [PR] Optimize DirectIOIndexInput [lucene]

2025-01-06 Thread via GitHub
mikemccand commented on code in PR #14103: URL: https://github.com/apache/lucene/pull/14103#discussion_r1904581374 ## lucene/misc/src/java/org/apache/lucene/misc/store/DirectIODirectory.java: ## @@ -428,19 +441,110 @@ public void readBytes(byte[] dst, int offset, int len) throw

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
msokolov commented on PR #14097: URL: https://github.com/apache/lucene/pull/14097#issuecomment-2573831315 I will note that tests with smaller indexes don't show such dramatic improvements - more support for the theory that graph decoding is what is helped, because there are no real compress

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
msokolov commented on PR #14097: URL: https://github.com/apache/lucene/pull/14097#issuecomment-2573825773 Regarding merging luceneutil tooling - I will open a PR, but suggest we hold off merging until this change hits Lucene -- This is an automated message from the Apache Git Service. To

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
msokolov commented on PR #14097: URL: https://github.com/apache/lucene/pull/14097#issuecomment-2573825107 I am not sure, but surmising that search performance is improved because of some combination of (1) graph ordinal decoding being faster (since we encode using VInts and these are now sm

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
benwtrent commented on PR #14097: URL: https://github.com/apache/lucene/pull/14097#issuecomment-2573818646 These are exciting numbers! Its interesting how improved search latency is dropping the index build time. Do we know why the searching times are so much better? Is it simply beca

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
mikemccand commented on PR #14097: URL: https://github.com/apache/lucene/pull/14097#issuecomment-2573791823 @msokolov do you have changes to luceneutil's `knnPerfTest.py` to enable this? Let's merge those upstream (to luceneutil) too ... I'm working on getting nightly benchy to run `knnPer

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2025-01-06 Thread via GitHub
mikemccand commented on PR #13984: URL: https://github.com/apache/lucene/pull/13984#issuecomment-2573785203 Thank you for persisting on this important change @benchaplin! I applied this PR to my local Lucene clone and ran `CheckIndex` on the vector index created by [last night's night

[PR] Add CHANGES.txt entry for HNSW DFS Optimization #14022 [lucene]

2025-01-06 Thread via GitHub
viswanathk opened a new pull request, #14104: URL: https://github.com/apache/lucene/pull/14104 Modifying the CHANGES.txt entry for #14022 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] Optimize DFS while marking connected components [lucene]

2025-01-06 Thread via GitHub
viswanathk commented on PR #14022: URL: https://github.com/apache/lucene/pull/14022#issuecomment-2573712838 Yeah, let me make them real quick. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Specialize DisiPriorityQueue for the 2-clauses case. [lucene]

2025-01-06 Thread via GitHub
gsmiller commented on code in PR #14070: URL: https://github.com/apache/lucene/pull/14070#discussion_r1904502387 ## lucene/core/src/java/org/apache/lucene/search/DisiPriorityQueueN.java: ## @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or m

Re: [PR] Remove scoreAll() optimization from DefaultBulkScorer. [lucene]

2025-01-06 Thread via GitHub
gsmiller commented on code in PR #14039: URL: https://github.com/apache/lucene/pull/14039#discussion_r1904467456 ## lucene/core/src/java/org/apache/lucene/search/Weight.java: ## @@ -289,75 +262,108 @@ static int scoreRange( } } - int doc = iterator.docID()

Re: [PR] SortedSet DV Multi Range query [lucene]

2025-01-06 Thread via GitHub
gsmiller commented on code in PR #13974: URL: https://github.com/apache/lucene/pull/13974#discussion_r1904400249 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java: ## @@ -0,0 +1,300 @@ +/* + * Licensed to the Apache Software Foundation (AS

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-01-06 Thread via GitHub
tveasey commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2573444675 Just sticking purely to the issues raised regarding this PR and the blog Ben linked explaining the methodology... > Although the RaBitQ approach is conceptually rather different to

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-01-06 Thread via GitHub
ChrisHegarty commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2573298347 In my capacity as the Lucene PMC Chair (and with explicit acknowledgment of my current employment with Elastic, as of the date of this writing), I want to emphasize that proper attr

Re: [PR] Add two new "Seeded" Knn queries for seeded vector search [lucene]

2025-01-06 Thread via GitHub
benwtrent commented on code in PR #14084: URL: https://github.com/apache/lucene/pull/14084#discussion_r1904255306 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -67,7 +70,21 @@ public static void search( HnswGraphSearcher graphSearcher =

Re: [PR] Add two new "Seeded" Knn queries for seeded vector search [lucene]

2025-01-06 Thread via GitHub
benwtrent commented on code in PR #14084: URL: https://github.com/apache/lucene/pull/14084#discussion_r1904254316 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -67,7 +70,21 @@ public static void search( HnswGraphSearcher graphSearcher =

Re: [PR] Add two new "Seeded" Knn queries for seeded vector search [lucene]

2025-01-06 Thread via GitHub
benwtrent commented on code in PR #14084: URL: https://github.com/apache/lucene/pull/14084#discussion_r1904253643 ## lucene/core/src/java/org/apache/lucene/search/knn/SeededKnnCollectorManager.java: ## @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

Re: [PR] Update copyright year in NOTICE.txt file. [lucene]

2025-01-06 Thread via GitHub
cpoerschke merged PR #14098: URL: https://github.com/apache/lucene/pull/14098 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2025-01-06 Thread via GitHub
benwtrent commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1904112483 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2785,191 @@ public static Status.VectorValuesStatus testVectors( return status;

Re: [PR] Optimize DFS while marking connected components [lucene]

2025-01-06 Thread via GitHub
msokolov commented on PR #14022: URL: https://github.com/apache/lucene/pull/14022#issuecomment-2573041092 @viswanathk I just merged and then belatedly realized we should also have a CHANGES.txt entry for this - I guess it belongs under Optimizations heading -- do you want to add? And then w

Re: [PR] Optimize DFS while marking connected components [lucene]

2025-01-06 Thread via GitHub
msokolov merged PR #14022: URL: https://github.com/apache/lucene/pull/14022 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] Optimize DFS while marking connected components [lucene]

2025-01-06 Thread via GitHub
msokolov commented on PR #14022: URL: https://github.com/apache/lucene/pull/14022#issuecomment-2573037086 sorry for the delay - holidays intervened! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-01-06 Thread via GitHub
gaoj0017 commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2573030521 Hi @msokolov , the discussion here is not only about the blog posts but also related to the pull request here. In this pull request (and its related blogs), it claims a new method witho

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2025-01-06 Thread via GitHub
msokolov commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1904101920 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2785,191 @@ public static Status.VectorValuesStatus testVectors( return status;

Re: [PR] HNSW BP reordering [lucene]

2025-01-06 Thread via GitHub
msokolov commented on PR #14097: URL: https://github.com/apache/lucene/pull/14097#issuecomment-2572995493 Right, I was kind of hoping @jpountz would review, but perhaps he's out for vacation. Most of this has already been seen and he approved the earlier PR. The main new thing here that mig

Re: [PR] Misc cleanups to TopScoreDocCollector [lucene]

2025-01-06 Thread via GitHub
original-brownbear merged PR #13935: URL: https://github.com/apache/lucene/pull/13935 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...