msokolov commented on issue #14127: URL: https://github.com/apache/lucene/issues/14127#issuecomment-2603066263
I spent some time trying to understand how this arose, and working on a fix, and I believe that the BP reordering exposed a pre-existing behavior in the component-merging code which could create these duplicates. Nothing really prevents it, but it didn't happen before, I'm not completely sure why, but my best theory is that because the way graphs are created we always add docs in docid order, but this is not true when reordering. I looked in to how to prevent the duplicates, and one thing we could do is to remove them when writing the graph in the codec (in `Lucene99HnswVectorsWriter.writeGraph`). This is a good place to do it because we sort the nodes there. Doing this in HnswGraphBuilder also is possible, but I think it would be less efficient because the neighbor nodes aren't sorted and any given node's neighbors might need to be checked multiple times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org