virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430828452
> If the datanode is connected to observer namenode, it can serve requests, why we need to shutdown The observer namenode takes a different condition. I was actually thinking about making this include observer namenode too i.e. if datanode has not received heartbeat from observer or active namenode in the last e.g. 30s or so, then it should shutdown. This is an option, no issues with it. > Even if it is connected to standby, a failover happens and it will be in good shape, else if you restart a bunch of datanodes, the new namenode will be flooded by block reports and just increasing problems. This problem would occur only if we select reasonably lower number. The recommendation for this config value is high enough to include extra time duration for namenode failover. > If something gets messed up with Active namenode, you shutdown all, the BR are already heavy, you forced all other namenodes to handle them again, making failover more difficult. and if it is some faulty datanodes which lost connection, you didn't get that alarmed, and all Standby and Observers will keep on getting flooded by BRs, so in case Active NN literally dies and tries to failover to any of the Namenode which these Datanodes were connected, will be fed with unnecessary loads of BlockReports. (BR has an option of initial delay as well, it isn't like all bombard at once and you are sorted in 5-10 mins) The moment when active namenode becomes messy, or dies, this is exactly what can impact the availability of the hdfs service. So either we have Observer namenode take care of read requests in the meantime or the failover needs to happen. If neither of that happens, it's the datanode that is not really useful by staying the in cluster for longer duration. Let's say namenode gets bad and failover does take time, the new active one is anyways going to take time processing BRs right? > If something got messed with the datanode, that is why it isn't able to connect to Active. If something is in Memory not persisted to disk, or some JMX parameter or N/W parameters which can be used to figure out things gets lost. Do you mean hsync vs hflush kind of thing for in prgress files? Is that not already taken care of? > That is the reason most cluster administrator in not so cool situations, show XYZ datanode is unhealthy or not, if in some case they don't it should be handled over there. The response would take time from the cluster admin applications. Why not get auto healed by datanode? Also it's not that this change is going to terminate the datanode, it's going to shut down properly. > In case of shared datanodes in a federated setup, say it is connected to Active for one Namespace and has completely lost touch with another, then? Restart to get both working? Don't restart so that at least one stays working? Both are correct in there own ways and situation and the datanode shouldn't be in a state to decide its fate for such reasons. IMO any namespace that is not connected to active namenode is not up for serving requests from active namenode and hence it's not in good state. I got your point but the health of a datanode should be determined based on whether all BPs are connected to active in the federated setup, is that not the real factor determining the health of datanode? > Making anything configurable doesn't justify having it in. if we are letting any user to use this via any config as well, then we should be sure enough it is necessary and good thing to do, we can not say ohh you configured it, now it is your problem... I am not making claim only based on making this configurable feature. But it is reasonable enough to determine best course of action for given situation. The only recommendation I have is: user should be able to get the datanode to decide whether it should shutdown gracefully when it has not heard anything from active or observer namenode for the past x sec (50/60s or so). I have tried my best to answer above questions. Please also take a look at the Jira/PR description where this idea has been taken from. We have seen issues with specific infra and until manually shutting down datanodes, we don't see any hope for improving availability, this has happened at multiple times. Please keep in mind that cluster administrators in cloud native env do not have access to JMX metrics due to the security constraints. Really appreciate all your points and suggestions Ayush, please take a look again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
