mikemccand opened a new issue, #15427:
URL: https://github.com/apache/lucene/issues/15427

   ### Description
   
   I was talking to @msokolov just now and this random idea came up... this is 
just brainstormy/speculation idea at this point:
   
   The HNSW graph is built incrementally: for each new vector to insert, we run 
a topK search, only seeing all previously added vectors, then add a new node 
connecting to those top K (we might also prune existing nodes connections to 
keep diversity).  We do this one by one (well, concurrently) ... but this means 
(maybe?) vectors added early can have poor-ish neighbors since they didn't see 
all the future vectors yet.  But, when future vectors are added, when they are 
close to an already-added vector, they will also add the return transition 
(from old node to new node), so this sort of makes up for that old node not 
seeing future vectors yet?
   
   Anyway, it made me wonder whether a 2nd phase through all the vectors, with 
the "benefit of hindsight", might somehow yield improvements (smaller graph, 
faster/better search-time recall curve)?
   
   I think @msokolov mentioned that the original HNSW paper had talked about 
this behavior but it was somehow OK / a benefit / not a problem?
   
   If we were to plot "average node neighbor distance" as a function of when 
the node was added, I wonder if we'd see a correlation of lower quality 
neighbors for earlier added vectors?  Or, is the bidirectional linking 
completely offsetting this bias?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to