virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1432160563
In a distributed system, it is essential to have robust fail-fast mechanisms in place to prevent issues related to network partitioning. The system must be designed to prevent further degradation of availability and consistency in the event of a network partition. Several distributed systems offer fail-safe approaches, and for some, partition tolerance is critical to the extent that even a few seconds of heartbeat loss can trigger the removal of an application server instance from the cluster. For instance, a majority of zooKeeper clients utilize the ephemeral nodes for this purpose to make system reliable, fault-tolerant and strongly consistent in the event of network partition. From the hdfs architecture viewpoint, it is crucial to understand the critical role that active and observer namenode play in file system operations. In a large-scale cluster, if the datanodes holding the same block (primary and replicas) lose connection to both active and observer namenodes for a significant amount of time, delaying the process of shutting down such datanodes and restarting it to re-establish the connection with the namenodes (assuming the active namenode is alive, assumption is important in the even of network partition to reestablish the connection) will further deteriorate the availability of the service. This scenario underscores the importance of resolving network partitioning. This is a real use case for hdfs and it is not prudent to assume that every deployment or cluster management application must be able to restart datanodes based on JMX metrics, as this would introduce another application to resolve the network partition impact of hdfs. Besides, popular cluster management applications are not typically used in all cloud-native env. Even if these cluster management applications are deployed, certain security constraints may restrict their access to JMX metrics and prevent them from interfering with hdfs operations. The applications that can only trigger alerts for users based on set parameters (for instance, missing blocks > 0) are allowed to access JMX metrics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
