mikemccand opened a new issue, #15427: URL: https://github.com/apache/lucene/issues/15427
### Description I was talking to @msokolov just now and this random idea came up... this is just brainstormy/speculation idea at this point: The HNSW graph is built incrementally: for each new vector to insert, we run a topK search, only seeing all previously added vectors, then add a new node connecting to those top K (we might also prune existing nodes connections to keep diversity). We do this one by one (well, concurrently) ... but this means (maybe?) vectors added early can have poor-ish neighbors since they didn't see all the future vectors yet. But, when future vectors are added, when they are close to an already-added vector, they will also add the return transition (from old node to new node), so this sort of makes up for that old node not seeing future vectors yet? Anyway, it made me wonder whether a 2nd phase through all the vectors, with the "benefit of hindsight", might somehow yield improvements (smaller graph, faster/better search-time recall curve)? I think @msokolov mentioned that the original HNSW paper had talked about this behavior but it was somehow OK / a benefit / not a problem? If we were to plot "average node neighbor distance" as a function of when the node was added, I wonder if we'd see a correlation of lower quality neighbors for earlier added vectors? Or, is the bidirectional linking completely offsetting this bias? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
