Vikasht34 commented on issue #14214:
URL: https://github.com/apache/lucene/issues/14214#issuecomment-2646005402

   ```
   private void connectComponents() {
       BitSet visited = new BitSet(graph.size());
       List<Integer> entryPoints = new ArrayList<>();
       
       for (int node = 0; node < graph.size(); node++) {
           if (!visited.get(node)) {
               List<Integer> component = new ArrayList<>();
               exploreComponent(node, visited, component);
   
               if (!entryPoints.isEmpty()) {
                   int bestCandidate = findBestCandidate(entryPoints, 
component);
                   connectNodes(entryPoints.get(0), bestCandidate);
               }
   
               entryPoints.add(component.get(0));  // Add first node as entry
           }
       }
   }
   
   ```
   
   I see 3 problems why it could be if it is poor
   
   #### ** Inefficient Component Exploration (`exploreComponent`)**
   - Recursively visits **every unconnected node**, leading to **O(N²) 
complexity** in worst-case scenarios.
   - Causes excessive CPU utilization when dealing with **large numbers of 
unconnected components**.
   
   #### ** Unbounded Connection Attempts (`findBestCandidate`)**
   - The method **tries to optimally connect every component**, which becomes 
**extremely expensive for highly similar vectors**.
   - Leads to **exponential slowdowns** when vector clusters are densely packed 
or contain duplicates.
   
   #### ** Repeated Work (`connectNodes`)**
   - If **multiple small components exist**, the function **makes too many 
unnecessary connections**.
   - This results in **high CPU overhead** as the method attempts to fully 
optimize the graph, even when minimal connectivity is sufficient.
   
   ---
   
   
   Idea of CAP is  as a quick fix to prevent infinite execution in 
connectComponents(), but it does not solve the root problem and remains 
computationally expensive. The function still performs O(N²) connectivity 
checks but gets forcefully terminated instead of completing properly.This means 
that most of the CPU time is still wasted on redundant checks, and the graph 
may remain unoptimized or disconnected.
   
   ###  What do u think of union find to solve this ?
   We replace the **brute-force merging** with **Union-Find**, which **tracks 
components dynamically**:
   
   1. **Reduces Complexity to O(N log N)**
      - Union-Find **efficiently tracks connected components**, avoiding 
**repeated work**.  
   
   2. **Eliminates Unnecessary Checks**
      - Instead of checking **every node again**, we **only merge truly 
disconnected components**.  
   
   3. **Stops Early When Graph is Mostly Connected**
      - We **stop merging early** if **95% of the graph is already connected**, 
preventing **wasted work**.
   
   ---
   
   Some Sudo code 
   
   ```
   public class HnswGraphConnectivity {
       private final int[] parent;
       private final int[] rank;
       private final int size;
   
       public HnswGraphConnectivity(int size) {
           this.size = size;
           parent = new int[size];
           rank = new int[size];
           for (int i = 0; i < size; i++) parent[i] = i; // Each node starts in 
its own component
       }
   
       // Path compression: Makes find() O(1) amortized
       public int find(int x) {
           if (parent[x] != x) parent[x] = find(parent[x]); 
           return parent[x];
       }
   
       // Union by rank: Ensures efficient merging
       public void union(int x, int y) {
           int rootX = find(x);
           int rootY = find(y);
           if (rootX != rootY) {
               if (rank[rootX] > rank[rootY]) parent[rootY] = rootX;
               else if (rank[rootX] < rank[rootY]) parent[rootX] = rootY;
               else {
                   parent[rootY] = rootX;
                   rank[rootX]++;
               }
           }
       }
   
       public boolean isConnected(int x, int y) {
           return find(x) == find(y);
       }
   
       // Returns size of the largest connected component
       public int largestComponentSize() {
           Map<Integer, Integer> componentSize = new HashMap<>();
           for (int i = 0; i < size; i++) {
               int root = find(i);
               componentSize.put(root, componentSize.getOrDefault(root, 0) + 1);
           }
           return Collections.max(componentSize.values());
       }
   }
   ```
   
   ```
   private boolean connectComponents(int level) throws IOException {
           int graphSize = hnsw.size();
           HnswGraphConnectivity connectivity = new 
HnswGraphConnectivity(graphSize);
           
           // Step 1: Initialize connectivity tracking
           FixedBitSet notFullyConnected = new FixedBitSet(graphSize);
           int maxConn = (level == 0) ? M * 2 : M;
           List<Component> components = HnswUtil.components(hnsw, level, 
notFullyConnected, maxConn);
   
           if (infoStream.isEnabled("HNSW")) {
               infoStream.message("HNSW", "Found " + components.size() + " 
components on level=" + level);
           }
   
           // Step 2: Use parallel stream to process connections efficiently
           IntStream.range(0, components.size()).parallel().forEach(i -> {
               Component c = components.get(i);
               for (int neighbor : c.nodes()) {
                   if (neighbor != c.start()) {
                       connectivity.union(c.start(), neighbor);
                   }
               }
           });
   
           // Step 3: Stop early if graph is sufficiently connected (~95%)
           if (connectivity.largestComponentSize() > 
(CONNECTIVITY_THRESHOLD_PERCENT / 100.0) * graphSize) {
               if (infoStream.isEnabled("HNSW")) {
                   infoStream.message("HNSW", "Early stopping: " + 
CONNECTIVITY_THRESHOLD_PERCENT + "% of components are connected.");
               }
               return true;
           }
   
           // Step 4: Connect remaining components intelligently
           GraphBuilderKnnCollector beam = new GraphBuilderKnnCollector(2);
           int[] eps = new int[1];
           UpdateableRandomVectorScorer scorer = scorerSupplier.scorer();
   
           for (Component c : components) {
               if (!connectivity.isConnected(c.start(), eps[0])) {
                   beam.clear();
                   eps[0] = c.start();
                   scorer.setScoringOrdinal(eps[0]);
                   
                   // Find the best connection candidate
                   try {
                       graphSearcher.searchLevel(beam, scorer, level, eps, 
hnsw, notFullyConnected);
                       int bestNode = beam.popNode();
                       float score = beam.minimumScore();
                       link(level, bestNode, c.start(), score, 
notFullyConnected);
                       connectivity.union(bestNode, c.start());
                   } catch (Exception e) {
                       if (infoStream.isEnabled("HNSW")) {
                           infoStream.message("HNSW", "Failed to connect 
component: " + c.start());
                       }
                   }
               }
           }
   
           return true;
       }
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to