msokolov commented on issue #14214:
URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698452319

   I tried indexing some [NOAA climate 
data](https://www.ncei.noaa.gov/products/land-based-station/noaa-global-temp) 
that is four-dimensional (temperature over last 150 years for every 5 degree 
lat-long patch - 5MM docs) and reproduced this problem - it would just take a 
very long time indexing and then even longer in connectComponents.  As an 
experiment, I tried relaxing our diversity constraint with a simple patch and 
found it enabled the indexing to complete in a reasonable time for some HNSW 
graph parameter choices, but could still get into the adversarial 
connectComponent in some other cases. 
   
   ```
   diff --git 
a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java 
b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
   index 2fa7fed2a0d..0aebeb3236c 100644
   --- a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
   +++ b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
   @@ -381,6 +381,16 @@ public class HnswGraphBuilder implements HnswBuilder {
            neighbors.addInOrder(cNode, cScore);
          }
        }
   +    // populate any remaining spots with non-diverse neighbors
   +    for (int i = candidates.size() - 1; neighbors.size() < maxConnOnLevel 
&& i >= 0; i--) {
   +      if (mask[i] == false) {
   +        int cNode = candidates.nodes()[i];
   +        float cScore = candidates.scores()[i];
   +        assert cNode <= hnsw.maxNodeId();
   +        mask[i] = true;
   +        neighbors.addOutOfOrder(cNode, cScore);
   +      }
   +    }
        return mask;
      }
   ```
   
   A few conclusions:
   
   1. HNSW is not the best indexing data structure for every numerical vector 
data set. Probably Points (ie kd-tree) would be better for low-dimensional data 
(< ~12dim) ?
   2. Our connectComponents implementation has a horrible worst case that we 
need to fix.
   3. We might want to fiddle with our diversity criterion, but it isn't a 
solution for (2). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to