mikemccand commented on issue #15427:
URL: https://github.com/apache/lucene/issues/15427#issuecomment-3547768731

   > Very similar to the new fix up logic added in this PR: 
https://github.com/apache/lucene/pull/15003
   
   +1, I wonder if that fixup logic (which currently only kicks in on the 
"swiss cheese" case where an HNSW graph has holes because vectors were deleted) 
could also just run on an ordinary HNSW graph without holes?  Or does it 
specifically target the holes as the things needed fix-me-up attention?
   
   Maybe a simple offline one-off experiment/prototype to try: build the HNSW 
graph like we do today.  But then build another HNSW graph where you insert the 
same vectors in the reverse order.  Then merge the two graphs (map each same 
node from the two graphs onto each other, taking union of all edges).  Hmm do 
we have a merge/union API today on HNSW graph?  Then maybe prune a bit 
(diversity pruning?) to get avg connection count back to "typical", then see if 
that helps recall/performance tradeoff.  Sounds like a lot of work :)  But it 
might let us see if "hindsight" would've been helpful.
   
   Another thing I'm going to hopefully try soon: I worry that the vectors we 
get for a given corpus (e.g. for Wikipedia Cohere on HuggingFace that we use in 
[Lucene's nightly 
benchy](https://benchmarks.mikemccandless.com/knnResults.html)) might have been 
pre-sorted in some "interesting" way which might alter results.  This sort of 
sneaky benchmark attack vector is scary to me :)  There are so many sources of 
subtle "fucked up the benchmarking" already!  I don't need one more to fret 
about.  The order of documents can make a big difference (e.g. graph bipartite 
reordering results) and might lead to wrong conclusions in general and steer 
our development incorrectly, baking echos of that mistake into Lucene's 
algorithm choices!
   
   So I'm going to try a hopefully A/A test: `knnPerfTest.py` on the existing 
Cohere vectors (same order they are in at HuggingFace), then another run after 
shuffling the vectors.  Hopefully those two are nearly the same, under the 
"standard" HNSW/indexing noise floor (whatever THAT is, I don't know).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to