venkateshwaracholan commented on issue #16251: URL: https://github.com/apache/lucene/issues/16251#issuecomment-4697679092
@vsop-479 dug into this a bit further and was able to reproduce the behavior outside of the failing HNSW seed. Using the reported seed (`8B08865AA8BAB759:E27CE1241E7C9A82`), the generated graph has 111 nodes, with most nodes having degree 1 and two high-degree hubs (58 and 53). `computeJoinSet()` returns all 111 nodes and stops because `gTot >= gExit`. To understand whether this was specific to the seed or HNSW construction, I built a few graphs manually with `OnHeapHnswGraph` and traced `computeJoinSet()`. In particular, simple hub-and-spoke graphs (one hub with many degree-1 leaves) also produce `joinSet == graph`, even without random vectors or HNSW graph construction. Smallest handcrafted case I found is a 3-node hub-and-spoke graph (hub + 2 leaves). The trace is attached, but the short version is: * `gExit = 6` * hub contributes `gain = 4` * both leaves become stale and are re-evaluated with `gain = 1` * the algorithm reaches `gTot >= gExit` only after selecting all three nodes Result: `joinSet = 3/3`. By contrast, graphs with more overlap (e.g. two hubs with shared leaves) produce much smaller join sets. From the traces, it looks like `computeJoinSet()` is optimizing for reaching the coverage budget (`gExit`) rather than minimizing join-set size. On hub-and-spoke topologies, the hubs contribute a large initial gain, but after stale recalculation the leaves contribute only 1 each, so the algorithm ends up selecting nearly every node in order to reach `gExit`. This makes me wonder whether the failing seed is exposing a graph topology where `joinSet == graph` is expected behavior of the current heuristic, rather than a merge correctness issue. The test currently asserts `j.size() < graph.size()`, but I couldn't find a corresponding guarantee in `computeJoinSet()` itself. Curious what the intended behavior is here. Should the test be relaxed, or is a full join set considered a bug in the heuristic? [joinset-handcrafted-3node.log](https://github.com/user-attachments/files/28907986/joinset-handcrafted-3node.log) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
