msokolov commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698452319
I tried indexing some [NOAA climate data](https://www.ncei.noaa.gov/products/land-based-station/noaa-global-temp) that is four-dimensional (temperature over last 150 years for every 5 degree lat-long patch - 5MM docs) and reproduced this problem - it would just take a very long time indexing and then even longer in connectComponents. As an experiment, I tried relaxing our diversity constraint with a simple patch and found it enabled the indexing to complete in a reasonable time for some HNSW graph parameter choices, but could still get into the adversarial connectComponent in some other cases. ``` diff --git a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java index 2fa7fed2a0d..0aebeb3236c 100644 --- a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java +++ b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java @@ -381,6 +381,16 @@ public class HnswGraphBuilder implements HnswBuilder { neighbors.addInOrder(cNode, cScore); } } + // populate any remaining spots with non-diverse neighbors + for (int i = candidates.size() - 1; neighbors.size() < maxConnOnLevel && i >= 0; i--) { + if (mask[i] == false) { + int cNode = candidates.nodes()[i]; + float cScore = candidates.scores()[i]; + assert cNode <= hnsw.maxNodeId(); + mask[i] = true; + neighbors.addOutOfOrder(cNode, cScore); + } + } return mask; } ``` A few conclusions: 1. HNSW is not the best indexing data structure for every numerical vector data set. Probably Points (ie kd-tree) would be better for low-dimensional data (< ~12dim) ? 2. Our connectComponents implementation has a horrible worst case that we need to fix. 3. We might want to fiddle with our diversity criterion, but it isn't a solution for (2). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org