Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-03-27 Thread via GitHub
benwtrent commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2757791987 > Can we expose a graph construction parameter in Lucene99HnswVectorsFormat to gate the connectComponents() call? This would allow us to mitigate this issue while a more comprehen

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-03-24 Thread via GitHub
txwei commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2749658069 Can we expose a graph construction parameter in `Lucene99HnswVectorsFormat` to gate the `connectComponents()` call? This would allow us to mitigate this issue while a more comprehensi

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-03-04 Thread via GitHub
msokolov commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698456606 Maybe as a short-term mitigation we should revert or disable the `connectComponents` impl since its supposed improvements are kind of theoretical and it comes with a deadly vulnera

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-03-04 Thread via GitHub
benwtrent commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698475184 The goal of connectComponents is to help graphs that have gaps in their connectivity. However, when its needed most (e.g. tons of gaps and poor connectivity), it does more harm th

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-03-04 Thread via GitHub
msokolov commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698452319 I tried indexing some [NOAA climate data](https://www.ncei.noaa.gov/products/land-based-station/noaa-global-temp) that is four-dimensional (temperature over last 150 years for ever

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-21 Thread via GitHub
benwtrent commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2675053049 > Sorry I didn't understand this. @msokolov I mean that during search, we would by default have more neighbors to look at. Ensuring diversity eagerly means that its possible

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-21 Thread via GitHub
msokolov commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2675147250 > I mean that during search, we would by default have more neighbors to look at. I see, yes, that's true. -- This is an automated message from the Apache Git Service. To

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-21 Thread via GitHub
Vikasht34 commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2675065270 @benwtrent As far as I understand from your idea is to use Delaunay Triangulation and skip connectComponents() ? -- This is an automated message from the Apache Git Service. To

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-21 Thread via GitHub
msokolov commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2674568528 > I am not sure if its "better" I would assume it makes search slower on well distributed and distinguished vectors :/ Sorry I didn't understand this. In the normal case, we'

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-21 Thread via GitHub
benwtrent commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2674533124 > This means we have to sort at that point IIRC, but it might be a better, more robust choice? I am not sure if its "better" I would assume it makes search slower on well d

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-21 Thread via GitHub
msokolov commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2674487026 In the original diversity impl we had allowed the neighbors array to fill without regard to any diversity criterion, and only started imposing it once the array was full. This mea

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-20 Thread via GitHub
benwtrent commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2671796667 So, these adverse scenarios where connect components has to do a ton of work all stem from us keeping the graph very sparse (e.g. only connecting diverse nodes). I wonder

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-13 Thread via GitHub
Vikasht34 commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2657428175 Interesting , let me quickly run those tests my self also to see what would be impact!! Thanks for logs .. -- This is an automated message from the Apache Git Service. To respon

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-12 Thread via GitHub
benwtrent commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2654755040 So, verifying the "fewDistinct" slowness, here is how connect components works in this adverse case: ``` 1> HNSW 1 [2025-02-12T20:14:45.641640Z; TEST-TestKnnFloatVector

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-08 Thread via GitHub
Vikasht34 commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2646005402 ``` private void connectComponents() { BitSet visited = new BitSet(graph.size()); List entryPoints = new ArrayList<>(); for (int node = 0; node < gra

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-07 Thread via GitHub
tteofili commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2642779168 this can be reproduced with either of the following tests ```java public void testSameVectorIndexedMultipleTimes() throws IOException { try (Directory d = newDirecto