jbellis opened a new pull request, #12421:
URL: https://github.com/apache/lucene/pull/12421

   ### Motivation
   I need to support concurrent reads and writes to an HNSW index for 
Cassandra.  One option would be to partition the document range and assign each 
range to a single thread with the existing implementation.  However,
   * It is much faster to query a single on-heap hnsw index, than to query 
multiple such indexes and combine the result.  (log(N) vs M log(N/M) where N is 
the number of documents and M is the number of partitions).
   * I need to ultimately write the combined index to disk, so even with some 
contention necessarily occurring during building of the index, we still come 
out way ahead in terms of total efficiency vs creating per-thread indexes and 
combining them, since combining the per-thread indexes boils down to "pick the 
largest and then add all the other nodes normally," you don't really benefit 
from having computed the others previously and you end up doing close to 2x the 
work.
   
   ### Performance
   Numbers are from my SIFT test harness at 
https://github.com/jbellis/hnswdemo/tree/lucene-bench.  Numbers are averaged 
across 5 test runs on my i9-12900 (8 performance cores and 8 efficiency)
   
   Serial construction of 1M nodes: 292.3s
   Parallel construction, 1 thread: 379.0s = 129.7% of serial
   Parallel construction, 2 threads: 191.3s = 50.5% of parallel/1
   Parallel construction, 4 threads: 96.1s = 25.4% of parallel/1
   Parallel construction, 8 threads: 52.6s = 13.8% of parallel/1
   
   Serial queries of 100k vectors / top 100: 38.3s
   Parallel queries, 1 thread: 41.6s = 1.09% of serial
   Parallel queries, 2 threads: 21.0s = 50.5% of parallel/1
   Parallel queries, 4 threads: 10.3s = 24.7% of parallel/1
   Parallel queries, 8 threads: 5.3s = 12.7% of parallel/1
   
   To summarize, there is about a 30% overhead during construction and 10% 
overhead at query time from using the concurrent class.  The concurrent class 
parallelizes construction nearly perfectly through 4 threads, with some 
additional inefficiency becoming visible at 8.  (Probably this is the effect of 
having to do more vector comparisons across the concurrently added nodes – I 
would expect this to remain relatively small and not exploding as thread counts 
increase.)  Uncontended queries scale close to perfectly to at least 8 threads.
   
   ### Design and implementation
   ConcurrentOnHeapHnswGraph is very similar to OnHeapHnswGraph with concurrent 
backing Collections.  The main addition is a getView operation to provide a 
threadsafe snapshot of the graph for searches.  The View uses a 
CompletionTracker class to provide a kind of snapshot isolation – otherwise it 
is impossible to prevent partially added nodes from causing very difficult to 
debug race conditions.  This is used during construction as well as for 
user-invoked searches.
   
   (The initial CompletionTracker implementation was just an AtomicInteger 
clock and a Map<node Integer, clock Integer>, but the constant box/unbox 
introduced significant CPU and GC overhead.  The current implementation is a 
result of optimizing that.)
   
   ConcurrentHnswGraphBuilder adds an incremental ram used estimate as a return 
value to addGraphNode, and a buildAsync method that takes an ExecutorService 
for fine-grained control over parallelism.  Otherwise, it follows the API of 
HnswGraphBuilder closely.  (Closely enough that my original PR extracted a 
common interface and added factories so that they could be plugged in 
interchangeably, but this is now removed after Michael’s feedback.)
   
   The key to achieving good concurrency while maintaining correctness without 
synchronization is, we track in-progress node additions across all threads in a 
ConcurrentSkipListSet. After adding ourselves in addGraphNode, we take a 
snapshot of this set (this is O(1) for CSLS), and consider all other 
in-progress updates as neighbor candidates (subject to normal level 
constraints).
   
   In general, the main concern with the Concurrent Builder compared to the 
serial is to make sure that partially complete operations never “leak” to other 
threads.  In particular,
   * Neighbor manipulation has been encapsulated in ConcurrentNeighborSet.  CNS 
implements a copy-on-write NeighborArray – my initial implementation used a 
ConcurrentSkipListSet but this was significantly slower since even during 
construction, there are many more “iterate through the neighbors” operations 
than “change the neighbors.”  We have to subclass NeighborArray here to be able 
to efficiently handle the case of concurrently inserting the same node (as a 
forward-link and a back-link).
   * Entry point updating is not done until the new node has been fully added.
   
   One more point is a little subtle – 
   * When a new node is added, it considers both existing nodes in the graph as 
candidates, as well as other nodes concurrently in the process of being added. 
It does this by taking a snapshot of the "insertionsInProgress" set when the 
node begins adding itself, and using both the snapshot and the current graph 
state as candidate sources. This allows the graph to remain connected even when 
many nodes are added concurrently.
   * The subtle part is that we compute diversity *separately* for the 
fully-added and the in-progress candidates.  This is because there is an 
asymmetry between initial links (you need to be diverse or you’re discarded) 
and backlinks added later (diversity is not enforced again until you bump up 
against max connections).  Treating them separately allows us to get very 
similar results to what you would get adding each node serially.  See the 
discussion in addGraphNode about over-pruning for an example.
   * I think we could simplify this linking of new nodes to consider 
fully-added and in-progress candidates as a group instead of separately, if we 
implemented the paper’s “keep pruned connections” suggestion.  I have 
experimented with this, and I think the recall:memory used benefits are good, 
but it deserves a followup ticket.
   
   The main graph test suites are identical except for changes to perform 
concurrent graph builds instead of serial, and the addition of testing the 
incremental ram usage estimate from ConcurrentHnswGraphBuilder::addGraphNode.  
I also added new tests for ConcurrentNeighborSet.
   
   ### Minor changes to existing code:
   * HnswGraphSearcher gets a new searchConcurrent method that uses a 
GrowableBitSet, since the size of the graph may change during the search.  For 
the same reason, removed  "assert friendOrd < size" from the search methods.
   * Moved JavaUtilBitSet out of BaseBitSetTestCase and renamed to 
GrowableBitSet.
   * Several NeighborArray fields move to protected since we're subclassing it 
now.
   * Updates OnHeapHnswGraph ramBytesUsed for the TreeMap -> HashMap change 
made in 3c163745, although apparently it’s close enough either way to mostly 
not matter.
   * Added --add-opens in build.gradle for tests to allow RamUsageEstimator to 
compute exact values when checking correctness of my estimates.
   * Added HnswGraph.addNode (with default unsupportedoperation) to document 
the shared expectations in one place.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to