[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe

Michael Sokolov (Jira) Mon, 23 May 2022 14:37:06 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541180#comment-17541180
 ]


Michael Sokolov commented on LUCENE-10590:
------------------------------------------

> Love the title, Michael Sokolov. Very Douglas-y Adams-y.
 :starry eyes:

So I wrote a unit test, wrapped the `RandomVectorValues.vectorValue(int)` 
method to see where it was being called, and fiddled around with 
`BoundsChecker` to see what would happen if we swapped its `<` with a `<=`, and 
what I found is that in the existing situation, indeed, we crawl over the 
entire graph every time we insert a node, because every node looks like a 
viable candidate (we only exclude nodes whose scores are `<` the current least 
score (or `>` for the inverse scoring functions)). But ... if we change to 
using `<=` (resp. `>=`) then the cost shifts over to 
`HnswGraphBuilder.findWorstNonDiverse` since there we early terminate in the 
opposite way.

Anyway that isn't very clear but the point is that these boundary conditions 
are sensitive to this equality case (where everything is equally distant to 
everything else) and they explode in different directions! Basically what we 
need to do is bias them to give up when stuff is exactly ==. Possibly 
BoundsChecker should get a new parameter (open/closed) 

> Indexing all zero vectors leads to heat death of the universe
> -------------------------------------------------------------
>
>                 Key: LUCENE-10590
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10590
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael Sokolov
>            Priority: Major
>
> By accident while testing something else, I ran a luceneutil test indexing 1M 
> 100d vectors where all the vectors were all zeroes. This caused indexing to 
> take a very long time (~40x normal - it did eventually complete) and the 
> search performance was similarly bad.  We should not degrade by orders of 
> magnitude with even the worst data though.
> I'm not entirely sure what the issue is, but perhaps as long as we keep 
> finding hits that are "better" we keep exploring the graph, where better 
> means (score, -docid) >= (lowest score, -docid). If that's right and all docs 
> have the same score, then we probably need to either switch to > (but this 
> could lead to poorer recall in normal cases) or introduce some kind of 
> minimum score threshold?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe

Reply via email to