Re: [Discuss] FIP-9: Support CoordinatorServer High Availability

cheng z Mon, 09 Mar 2026 02:00:01 -0700

Hi Jark!
  Thank you for your suggestions. For your comments on the current FIP. I'm
generally in agreement.

1) Coordinator Epoch Implementation
Yes I will check all request from coordinator to tablet to ensure the epoch
fencing logic.

2) 'coordinator.id' and 4) Path Structure Optimization
OK I just learned that the design of UUIDs is intended for stateless
services. Then we don't need coordinator.id and it's node path in ZK.

3) ZK Operation Validation
In fact we don't need all ZK operations to validate
coordinatorEpochZkVersion. We have to validate Coordinator
Create/Update/Delete request. In the other way, we do not need to validate
request from tablet server, or "read" request from coordinator. It contains
the following request:
  a. Bucket lead and isr info create/drop/update(when reassignment)
  b. Table and partition create/drop
  c. ACL create/delete/update
  d. ...
I'll supplement this part description in FIP.

5) Standby RPC Behavior
In my current design, we don't need to start RPC service. For
"bootstrap.server", I recommend to just set tablet-server ip address
instead of coordinator ip because tablet server is relatively stable.
Coordinator ip may be invalid when leader change.

6) Test Plan Enhancements
Thanks for the reminder. I will add more test case.

Jark Wu <[email protected]> 于2026年3月5日周四 21:08写道：

> Hi cheng,
>
> Thanks for the great proposal. I think this is a very important
> feature for Fluss.
>
> Here are my comments about the FIP:
> 1) Coordinator Epoch Implementation
> The coordinator epoch is currently a placeholder variable fixed at 0.
> This logic must be updated to use the actual epoch variable.
> Additionally, some RPCs in the coordinator-to-tablet channel do not
> carry the epoch. A comprehensive check is required to ensure both
> sides implement epoch fencing logic.
>
> 2) 'coordinator.id'
> Making coordinator.id mandatory is unnecessary. The current
> coordinator instance ID is a UUID used only for logging and
> consistency checks. Only stateful nodes require a lifecycle-unique ID.
> Since the coordinator is currently stateless, using a new UUID on each
> restart is acceptable.
>
> 3) ZK Operation Validation
> After an old leader recovers, it may attempt ZK operations (such as
> auto partition creation) before perceiving the leadership loss.
> Therefore, must all ZK operations include transaction validation using
> coordinatorEpochZkVersion? The FIP does not currently describe this.
> Please confirm.
>
> 4) Path Structure Optimization
> The path /coordinators/ids/[coordinatorId] can be eliminated. Each
> LeaderLatch participant can embed CoordinatorAddress data in its node
> via the Participant ID. Consequently, all coordinator addresses can be
> retrieved from the children of /coordinators/election/.
>
> Minor Issues
>
> 5) Standby RPC Behavior
> Define the external RPC behavior for standby nodes. Should it return
> NOT_LEADER exception to deny RPC or simply not start any RPC service?
>
> 6) Test Plan Enhancements
> The test plan must include scenarios for old leader reconnection and
> dual-leader concurrent writes.
>
> Best,
> Jark
>
> On Fri, 27 Feb 2026 at 14:32, cheng z <[email protected]> wrote:
> >
> > Hi devs,
> >
> > I propose initiating discussion on FIP-9[1]. As a critical component for
> > Fluss to scale for large-scale production deployment, Coordinator high
> > availability has remained missing until now. I am proposing this design
> > specifically to address this gap and thereby enable Fluss to be fully
> > reliable.
> >
> > An y feedback and suggestions on this proposal are welcome!
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-9%3A+Support+CoordinatorServer+High+Availability
> > <
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-9%3A+Support+CoordinatorServer+High+Availability
> >
> >
> > Best regards,
> > *zcoo*
>

Re: [Discuss] FIP-9: Support CoordinatorServer High Availability

Reply via email to