virajjasani commented on PR #5396:
URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430828452

   > If the datanode is connected to observer namenode, it can serve requests, 
why we need to shutdown
   
   The observer namenode takes a different condition. I was actually thinking 
about making this include observer namenode too i.e. if datanode has not 
received heartbeat from observer or active namenode in the last e.g. 30s or so, 
then it should shutdown. This is an option, no issues with it.
   
   
   > Even if it is connected to standby, a failover happens and it will be in 
good shape, else if you restart a bunch of datanodes, the new namenode will be 
flooded by block reports and just increasing problems.
   
   This problem would occur only if we select reasonably lower number. The 
recommendation for this config value is high enough to include extra time 
duration for namenode failover.
   
   
   > If something gets messed up with Active namenode, you shutdown all, the BR 
are already heavy, you forced all other namenodes to handle them again, making 
failover more difficult. and if it is some faulty datanodes which lost 
connection, you didn't get that alarmed, and all Standby and Observers will 
keep on getting flooded by BRs, so in case Active NN literally dies and tries 
to failover to any of the Namenode which these Datanodes were connected, will 
be fed with unnecessary loads of BlockReports. (BR has an option of initial 
delay as well, it isn't like all bombard at once and you are sorted in 5-10 
mins)
   
   The moment when active namenode becomes messy, or dies, this is exactly what 
can impact the availability of the hdfs service. So either we have Observer 
namenode take care of read requests in the meantime or the failover needs to 
happen. If neither of that happens, it's the datanode that is not really useful 
by staying the in cluster for longer duration. Let's say namenode gets bad and 
failover does take time, the new active one is anyways going to take time 
processing BRs right?
   
   
   > If something got messed with the datanode, that is why it isn't able to 
connect to Active. If something is in Memory not persisted to disk, or some JMX 
parameter or N/W parameters which can be used to figure out things gets lost.
   
   Do you mean hsync vs hflush kind of thing for in prgress files? Is that not 
already taken care of?
   
   
   > That is the reason most cluster administrator in not so cool situations, 
show XYZ datanode is unhealthy or not, if in some case they don't it should be 
handled over there.
   
   The response would take time from the cluster admin applications. Why not 
get auto healed by datanode? Also it's not that this change is going to 
terminate the datanode, it's going to shut down properly.
   
   
   > In case of shared datanodes in a federated setup, say it is connected to 
Active for one Namespace and has completely lost touch with another, then? 
Restart to get both working? Don't restart so that at least one stays working? 
Both are correct in there own ways and situation and the datanode shouldn't be 
in a state to decide its fate for such reasons.
   
   IMO any namespace that is not connected to active namenode is not up for 
serving requests from active namenode and hence it's not in good state. I got 
your point but the health of a datanode should be determined based on whether 
all BPs are connected to active in the federated setup, is that not the real 
factor determining the health of datanode?
   
   
   > Making anything configurable doesn't justify having it in. if we are 
letting any user to use this via any config as well, then we should be sure 
enough it is necessary and good thing to do, we can not say ohh you configured 
it, now it is your problem...
   
   I am not making claim only based on making this configurable feature. But it 
is reasonable enough to determine best course of action for given situation. 
The only recommendation I have is: user should be able to get the datanode to 
decide whether it should shutdown gracefully when it has not heard anything 
from active or observer namenode for the past x sec (50/60s or so).
   I have tried my best to answer above questions. Please also take a look at 
the Jira/PR description where this idea has been taken from. We have seen 
issues with specific infra and until manually shutting down datanodes, we don't 
see any hope for improving availability, this has happened at multiple times.
   
   Please keep in mind that cluster administrators in cloud native env do not 
have access to JMX metrics due to the security constraints.
   
   Really appreciate all your points and suggestions Ayush, please take a look 
again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to