kumarpritam863 opened a new pull request, #12372: URL: https://github.com/apache/iceberg/pull/12372
**Few Observations:** - In the ICR mode "**open**" receives only the "**newly added partitions**", and "**open**" will "**not be called**" by **connect framework** if there are **No New Partitions Assigned** to the **Task**. - Similarly in case of "**Close**" the Task receives only the "**removed partitions**" but we **blindly** close the "**Co-ordinator**". - The **coordinator** is created only in case "**open**" is called but in case when a **partition is revoked** and **no partition is added** on the task then only **close** will be called with that **revoked partition** without any **open call**. **How this is leading to NO-Coordinator Scenario:** - Consider the case when a partition other than partition ZERO is removed from the leader task and is assigned to some other task. - In this case a close call on leader will close the Co-ordinator but since this task will not get open call this will not lead to leader-election on this task. - As the other task which received the removed task has not received partition zero, leader election on that task will also not lead to that task being elected as leader. **Let's see this with the below example:** Initially we had one worker "**W0**" with two tasks "**T0**" and "**T1**" consuming from two partitions of one topic namely "**P0**" and "**P1**", so the initial configuration is: **W0 -> [{T0,P0}, {T1, P1}]** -> this will elect "**T0**" as the **co-ordinator** as it has "**P0**" Now another worker "**W1**" joins:-> this will lead to rebalancing on both tasks as well as topic partitions within those tasks. - Connect Framework will stop T1 on W0. - This will cause partition level rebalance and partition P1 will be assigned to T0. State at this point of time: **W0 -> [{T0,[P0, P1]}]** **W1 -> []** Now, - Connect FrameWork will start the T1 on W1. - This will again lead to a rebalance at the partition level. Assume P1 is removed from T0: - Connect Framework will call "close" call on T0 with Partition P1. - This will close the Co-ordinator on T0. - An "open" call we be made by the Connect FrameWork on T1 with partition P1. - This will lead to leader-election on T1 but since it is assigned P1, T1 will not be elected as Co-ordinator. Hence this leads to a No-Coordinator scenario. **Data Loss Scenario:** In Incremental Cooperative Rebalancing (ICR) mode, when rebalance happens, consumers do not stop consuming as their is no stop the world like in "Eager" mode of rebalancing. In case a partition is removed from a task, Consumer co-ordinator calls "close(Collection()) of the sink Task. In this call since we are blindly dumping all the files, this will dump the records also for the partitions still retained by this task. Moreover close call will make the committer null and since we have not null check for commiter in the put(Collection) , this will silently ignore the records. Once we get an open(Collection) call from Kafka this will start the commiter without resetting the offsets which leads to data loss. Document explaining both the scenario: https://docs.google.com/document/d/1okqGq1HXu2rDnq88wIlVDv0EmNFZYB1PhgwyAzWIKT8/edit?tab=t.0#heading=h.51qcys2ewbsa -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org