Hi cheng, Thanks for the great proposal. I think this is a very important feature for Fluss.
Here are my comments about the FIP: 1) Coordinator Epoch Implementation The coordinator epoch is currently a placeholder variable fixed at 0. This logic must be updated to use the actual epoch variable. Additionally, some RPCs in the coordinator-to-tablet channel do not carry the epoch. A comprehensive check is required to ensure both sides implement epoch fencing logic. 2) 'coordinator.id' Making coordinator.id mandatory is unnecessary. The current coordinator instance ID is a UUID used only for logging and consistency checks. Only stateful nodes require a lifecycle-unique ID. Since the coordinator is currently stateless, using a new UUID on each restart is acceptable. 3) ZK Operation Validation After an old leader recovers, it may attempt ZK operations (such as auto partition creation) before perceiving the leadership loss. Therefore, must all ZK operations include transaction validation using coordinatorEpochZkVersion? The FIP does not currently describe this. Please confirm. 4) Path Structure Optimization The path /coordinators/ids/[coordinatorId] can be eliminated. Each LeaderLatch participant can embed CoordinatorAddress data in its node via the Participant ID. Consequently, all coordinator addresses can be retrieved from the children of /coordinators/election/. Minor Issues 5) Standby RPC Behavior Define the external RPC behavior for standby nodes. Should it return NOT_LEADER exception to deny RPC or simply not start any RPC service? 6) Test Plan Enhancements The test plan must include scenarios for old leader reconnection and dual-leader concurrent writes. Best, Jark On Fri, 27 Feb 2026 at 14:32, cheng z <[email protected]> wrote: > > Hi devs, > > I propose initiating discussion on FIP-9[1]. As a critical component for > Fluss to scale for large-scale production deployment, Coordinator high > availability has remained missing until now. I am proposing this design > specifically to address this gap and thereby enable Fluss to be fully > reliable. > > An y feedback and suggestions on this proposal are welcome! > > [1] > https://cwiki.apache.org/confluence/display/FLUSS/FIP-9%3A+Support+CoordinatorServer+High+Availability > <https://cwiki.apache.org/confluence/display/FLUSS/FIP-9%3A+Support+CoordinatorServer+High+Availability> > > Best regards, > *zcoo*
