[D] Async Geo-Replication and Cluster Down Scenarios [pulsar]

via GitHub Tue, 14 Apr 2026 02:50:27 -0700


GitHub user shasank112001 created a discussion: Async Geo-Replication and 
Cluster Down Scenarios


With Async Geo Replication between 2 clusters, I understand that there is a 
special replicator on each cluster that consumes incoming messages, adds 
metadata of the originating cluster and then publishes it to the other side. I 
also understand that on both sides, you can configure a QueueSize for each 
replicator so it reads from the ledger in Chunks.
However, in the metrics there is also a metric for **replication-backlog** 
which keeps track of how many messages need to be replicated to the other side. 
I couldn't find any property to control this replication backlog size.

I have the following scenario:
1. Cluster A with topic X configured to replicate towards Cluster B.
2. Cluster B with topic X configured to replicate towards Cluster A.

Now, lets assume that in a failure scenario, Cluster A goes down. Therefore I 
have clients producing and consuming only from cluster B.  As the cluster A is 
down, the replicator cannot replicate any messages towards it. Therefore, the 
replication backlog keeps on increasing, while the topic backlog is near 0, as 
nearly all messages being produced are also consumed.

This scenario can pose a challenge, as it can lead to storage being full, 
because "acknowledged" messages are still waiting to be replicated.

Another question is how does messageTTL work with the replicator. If I send a 
message to Cluster B with TTL 5 seconds, and the replicator cannot replicate 
because Cluster A is down, does that mean after 5 seconds, the message will be 
removed from the replicator backlog as well?


GitHub link: https://github.com/apache/pulsar/discussions/25519

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

[D] Async Geo-Replication and Cluster Down Scenarios [pulsar]

Reply via email to