IndexFingerprint and Leader Election Slowness

Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) Thu, 01 May 2025 09:57:12 -0700

Hey all,

Can we talk for a moment about index fingerprinting? For those uninitialized 
here is the original ticket https://issues.apache.org/jira/browse/SOLR-8586 as 
well as the condition that motivated it 
https://issues.apache.org/jira/browse/SOLR-8129.

In short, when a leader fails during distributed update fan-out, some replicas
may get updates that others miss. Normally a new leader fills in any of its
gaps from other replicas, but if soft-commits have flushed those updates from
the t-log, comparing logs won’t catch the discrepancy. To avoid split-brain,
Solr currently checksums all non-deleted document versions in the index, since
versions are unique per shard.

However, this fingerprinting can be very slow process that blocks leader
election. In our investigation of election latency, Matt Biscocho and I saw it
take 10–60 seconds on large shards (hundreds of millions of docs).

A few observations and questions:

1)Can we parallelize the checksum? It's tempting to just slap a
parallelStream here
https://github.com/apache/solr/blob/25309f64685a8b70a3bb79c4a07eb8e005724600/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L2551,
although we’re wary of using the ForkJoinPool commonPool in a giant project
like Solr. Don't want to derail the main discussion but I've seen some arguable
misuse of parallelStream here, i.e. using it for blocking work.
2)Is fingerprinting relevant for TLOG+PULL? Since TLOG replicas only index
when elected and otherwise bypass indexing, is full index fingerprinting still
necessary? Segment downloads already include TCP-layer checksums.
3)Can fingerprinting be done more eagerly? Doing this work only at election
time stalls everything. Could we subtract deletions incrementally (as Yonik
originally suggested) as well as using a custom IndexReaderWarmer to hash new
segments on open? In-place updates are trickier but are a minority use case (at
least for us).

We're also eager to hear any other ideas—or hear how you’ve configured Solr to
avoid this issue altogether as we're sort of surprised we didn't see more
chatter about this on the mailing list.

Thanks,
Luke

IndexFingerprint and Leader Election Slowness

Reply via email to