[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541180#comment-17541180 ]
Michael Sokolov commented on LUCENE-10590: ------------------------------------------ > Love the title, Michael Sokolov. Very Douglas-y Adams-y. :starry eyes: So I wrote a unit test, wrapped the `RandomVectorValues.vectorValue(int)` method to see where it was being called, and fiddled around with `BoundsChecker` to see what would happen if we swapped its `<` with a `<=`, and what I found is that in the existing situation, indeed, we crawl over the entire graph every time we insert a node, because every node looks like a viable candidate (we only exclude nodes whose scores are `<` the current least score (or `>` for the inverse scoring functions)). But ... if we change to using `<=` (resp. `>=`) then the cost shifts over to `HnswGraphBuilder.findWorstNonDiverse` since there we early terminate in the opposite way. Anyway that isn't very clear but the point is that these boundary conditions are sensitive to this equality case (where everything is equally distant to everything else) and they explode in different directions! Basically what we need to do is bias them to give up when stuff is exactly ==. Possibly BoundsChecker should get a new parameter (open/closed) > Indexing all zero vectors leads to heat death of the universe > ------------------------------------------------------------- > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug > Reporter: Michael Sokolov > Priority: Major > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org