Hi cheng,

Thanks for the great proposal. I think this is a very important
feature for Fluss.

Here are my comments about the FIP:
1) Coordinator Epoch Implementation
The coordinator epoch is currently a placeholder variable fixed at 0.
This logic must be updated to use the actual epoch variable.
Additionally, some RPCs in the coordinator-to-tablet channel do not
carry the epoch. A comprehensive check is required to ensure both
sides implement epoch fencing logic.

2) 'coordinator.id'
Making coordinator.id mandatory is unnecessary. The current
coordinator instance ID is a UUID used only for logging and
consistency checks. Only stateful nodes require a lifecycle-unique ID.
Since the coordinator is currently stateless, using a new UUID on each
restart is acceptable.

3) ZK Operation Validation
After an old leader recovers, it may attempt ZK operations (such as
auto partition creation) before perceiving the leadership loss.
Therefore, must all ZK operations include transaction validation using
coordinatorEpochZkVersion? The FIP does not currently describe this.
Please confirm.

4) Path Structure Optimization
The path /coordinators/ids/[coordinatorId] can be eliminated. Each
LeaderLatch participant can embed CoordinatorAddress data in its node
via the Participant ID. Consequently, all coordinator addresses can be
retrieved from the children of /coordinators/election/.

Minor Issues

5) Standby RPC Behavior
Define the external RPC behavior for standby nodes. Should it return
NOT_LEADER exception to deny RPC or simply not start any RPC service?

6) Test Plan Enhancements
The test plan must include scenarios for old leader reconnection and
dual-leader concurrent writes.

Best,
Jark

On Fri, 27 Feb 2026 at 14:32, cheng z <[email protected]> wrote:
>
> Hi devs,
>
> I propose initiating discussion on FIP-9[1]. As a critical component for
> Fluss to scale for large-scale production deployment, Coordinator high
> availability has remained missing until now. I am proposing this design
> specifically to address this gap and thereby enable Fluss to be fully
> reliable.
>
> An y feedback and suggestions on this proposal are welcome!
>
> [1]
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-9%3A+Support+CoordinatorServer+High+Availability
> <https://cwiki.apache.org/confluence/display/FLUSS/FIP-9%3A+Support+CoordinatorServer+High+Availability>
>
> Best regards,
> *zcoo*

Reply via email to