Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

via GitHub Mon, 10 Feb 2025 21:35:39 -0800


Vikasht34 commented on issue #14208:
URL: https://github.com/apache/lucene/issues/14208#issuecomment-2649836148


   @benwtrent here are my thoughts on questions asked 
   
   **Entry point can be updated at any time (we need to think about this)**
   
   1. Two-Pass Merging to Handle Entry Point Changes
        -  Pass 1: Merge all layers without setting the entry point.
        - Pass 2: Re-evaluate the best entry point after merging
    
   **That the merging needs to be able to move vector values up to a higher 
layer and/or create a new layer** 
   
   Uses probabilistic layer assignment to determine whether vectors should be 
promoted after merging each layer.
   
   **but the bulk of the cost is still the bottom layer (as it has all vectors 
and its all eagerly allocated).**
   
   **1. Batch Processing for Bottom Layer Instead of Eager Allocation**
         - Instead of eagerly allocating all vectors, we process them in 
batches to reduce peak memory usage.
         - Each batch of vectors is merged and committed incrementally, 
preventing a large spike in memory consumption.
       
   
   **2. On-the-Fly Streaming Instead of Full Materialization**
          - Instead of fully storing neighbor lists in RAM, we use lazy loading 
(getNeighborsLazy()).
          - This means we only retrieve neighbors when needed, preventing 
unnecessary memory overhead.
   
   **3. Multi-Threaded Processing for the Bottom Layer**
         -     We distribute the bottom-layer merge across multiple CPU cores.
         -     This ensures that instead of a single-threaded bottleneck, we 
get true parallel merging.
       
    **4. Graph Sparsification (Reducing Redundant Connections)**
         -     Instead of blindly keeping all connections, we prune redundant 
edges.
         -     Uses HNSW's natural property of diverse neighbors to reduce 
connections intelligently, keeping only the most useful ones.
         
    **5. Union-Find for Efficient Component Merging**
       -     Avoids redundant merging of connected components.
       -     Union-Find ensures that each vector is connected once, preventing 
wasted CPU cycles.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

Reply via email to