Thanks Andrew for the detailed feedback and the corrections. I appreciate the context on the testing effort behind KIP-932—point well taken on the operational commitment required to push these through.
Taking your feedback into account, I want to clarify a few points and outline how I plan to proceed with the features we aligned on. Regarding point 2 (Archiving state), I stand corrected on the unbounded memory pressure. You are right that the in-flight limits handle the memory footprint. However, my concern shifts to liveness. If a DLQ outage persists, the share group will eventually hit the group.share.partition.max.record.locks threshold. At that point, the broker essentially stops yielding new records for that partition until locks expire or clear, causing a head-of-line blocking scenario that stalls the primary pipeline. I still believe an errors.deadletterqueue.write.timeout.ms fall-through is necessary to prevent a secondary topic outage from taking down the primary stream, but we can table that for a future discussion. Regarding points 1 and 3 (Atomicity and Bounded Retries), you make a fair point. Enforcing a broker-side retry cap directly contradicts the strict auditability requirement I mentioned in point 1. I will drop the bounded retry suggestion and instead track KIP-1289. Transactional acknowledgments are the right architectural fix for the atomicity gap. Regarding point 6 (DLQ Isolation), I respect the community's preference for flexibility here. Forcing a hard coordinator check removes options for teams managing multi-tenant clusters. I am withdrawing that suggestion. For the actionable items (Points 4 and 5), I am taking your advice and moving forward with official proposals to get these into the pipeline: First, to hit the AK 4.4 window, I have drafted KIP-1299 for the Mandatory DLQ Disposition Header. The plan is to reserve the _dlq.errors.* namespace (similar to how Kafka Connect handles its error context) and introduce an opt-out config like share.group.dlq.disposition.header.enable. The broker is the authoritative source for whether a message failed due to MAX_DELIVERY_ATTEMPTS_REACHED versus a CLIENT_REJECTED NACK, so attaching this at the broker level is the cleanest path for observability. Draft: https://cwiki.apache.org/confluence/display/KAFKA/KIP-1299%3AMandatory+DLQ+Disposition+Header+for+Share+Groups Second, I have opened KIP-1298 to formally propose the Circuit Breaker for Share Group DLQ Overflow. Your suggestion to leverage the delivery pause mechanics proposed in KIP-1249 is exactly the right approach. Building on top of that state machine logic will keep this feature much simpler to implement and test. Draft: https://cwiki.apache.org/confluence/display/KAFKA/KIP-1298%3ACircuit+Breaker+for+Share+Group+DLQ+Overflow I will start formal DISCUSS threads for both KIPs shortly. Thanks again for the architectural steer. *Note on KIP Numbering:* I am aware that there are currently overlapping claims for KIP-1298 and KIP-1299 on the mailing list. I originally created the wiki drafts for both of my proposals on March 14, 2026. This predates the conflicting discussion thread for KIP-1298, which was initiated in April , 2026 , as well as the overlapping KIP-1299 proposal. I have already reached out to the respective authors offline to coordinate and resolve the collision, so I have retained my originally drafted numbers in the links above. Best regards, Vaquar Khan On Sun, 8 Mar 2026 at 06:58, Andrew Schofield <[email protected]> wrote: > Hi Vaquar, > Thanks for your interest in KIP-1191. > > 1) As the KIP states, the DLQ writes are not entirely atomic. I do take > the point > that this might not be adequate for highly regulated industries. However, > it is > acceptable to state such a limitation in a KIP and then a follow-on KIP can > be used to tighten up the semantics if the community feels the need. The > provision of a DLQ mechanism for share groups is a major enhancement > even with this proviso. > > KIP-1289 is also going to be important for users who care deeply about > atomicity. That one is only in the early stages of discussion, but it will > bring > transactional acknowledgement for share groups. I expect transactional > DLQ writes could build upon that KIP. > > 2) You are not correct about the unbounded memory pressure. Archiving > records are considered in-flight and the number of in-flight records per > partition is limited already. So, a DLQ write problem will throttle > delivery > of additional records, which is inconvenient but not fatal. > > 3) This is interesting but of course it takes us back in the direction > of breaking the 1:1 audit trail requirement you mentioned in (1). If we > give > up after a bounded number of retries, what then? > > 4) The circuit breaker idea is potentially interesting. KIP-1249 proposes > the ability to pause delivery, so that might be a helpful building block. > > 5) A disposition header is also interesting. We’ll think about this. Given > that KIP-1191 is aiming for AK 4.4, there’s time for another micro-KIP > without affecting the intended schedule. > > 6) I disagree. Different organisations will have different policies for > DLQs. > Some might want a single DLQ for the entire cluster, while others might > want > a separate DLQ for each share group. The flexibility is intentional and > there is > no right answer. > > > If you’re interested in progressing (4), I encourage you to contribute a > KIP. Be > aware that doing so implies that you will be able to marshal the resources > to > get the KIP implemented to production quality, and there would be a > significant > amount of testing required. The team working on KIP-932 spent the majority > of > the time between AK 4.1 and 4.2 testing. We had automated soak tests > running > for months and progressively fixed many defects. Contributing a spec is not > sufficient by itself. > > Thanks, > Andrew > > > > On 8 Mar 2026, at 08:02, vaquar khan <[email protected]> wrote: > > > > Hi Andrew and team, > > > > Congrats on the KIP passing. The design is really solid and much needed > for > > the "Queues for Kafka" roadmap. I've been tied up, but finally had a > chance > > to look at the implementation path for share groups and wanted to flag a > > few "day 2" operational risks. In my experience with high-throughput > > pipelines, these are the edge cases that usually lead to 2 AM outages if > > the broker-side logic isn't tightened up before GA. > > > > 1. Coordinator Failover & Duplicates > > The KIP admits that DLQ writes and state topic updates aren't atomic, > > meaning a coordinator failover (and PID reset) will cause duplicates. For > > anyone in finance or regulated industries, this breaks the 1:1 audit > trail > > we rely on for compliance. This is a critical gap. We need a clear plan > for > > deduplication during the coordinator recovery path. > > > > 2. Handling a Stuck ARCHIVING State > > If the DLQ topic goes offline or hits a leader election, we can't let > > records sit in ARCHIVING indefinitely. Without a configurable > > errors.deadletterqueue.write.timeout.ms, records could stay stuck > during a > > sustained outage, creating unbounded memory pressure. I'd suggest a > > fall-through to ARCHIVED with a logged error to keep the system alive if > > the DLQ is unreachable. > > > > 3. Bounded Retries on the Broker > > The KIP mentions retrying on metadata/leadership issues but doesn't > specify > > a limit. I'd propose a new config — errors.deadletterqueue.write.retries > to > > provide a clean exit condition. Without a cap, a total partition failure > > could trigger an indefinite retry loop, wasting broker I/O and CPU. > > > > 4. Circuit Breaker for Systemic Failures > > This is the most critical point for me. If a downstream service dies, the > > share group will hit the delivery limit for every message, effectively > > draining the main topic into the DLQ in minutes. This kills message order > > and makes re-processing a nightmare. I'd propose a threshold if >20% of > > messages hit the DLQ in a rolling window, the group should PAUSE. It's > > always safer to stop the group than to dump the whole topic. > > > > 5. Mandatory Disposition Headers > > Since the broker already knows if a record failed due to > > MAX_DELIVERY_ATTEMPTS_REACHED vs. an explicit CLIENT_REJECTED NACK, we > > should make that a mandatory _dlq.errors.disposition header. Without it, > > operators can't distinguish a poison pill from a systemic timeout without > > digging through broker logs. > > > > 6. DLQ Ownership Check > > We should add a check at the coordinator level to ensure a DLQ topic > isn't > > shared by multiple groups. Cross-contamination makes the DLQ useless for > > debugging if you're seeing failures from unrelated applications in the > same > > stream. > > > > I'm particularly interested in your thoughts on the circuit breaker and > the > > write timeouts, as those seem like the biggest stability risks at scale. > > Happy to help spec either of these out if the team finds them worthwhile. > > > > Best regards, > > Vaquar Khan > > *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/ > > *Book *- > > > https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true > > *GitBook*- > https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/ > > *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan > > *github*-https://github.com/vaquarkhan > >
