[
https://issues.apache.org/jira/browse/HADOOP-18396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577364#comment-17577364
]
Steve Vaughan commented on HADOOP-18396:
----------------------------------------
[[email protected]] Did you have any comments about the individual changes?
Even in environments intended to be static, there are circumstances where
unplanned changes are required (e.g. hardware failures). In addition, having
servers silently ignoring configuration (dropping unresolved servers) because
of a hiccup during startup can lead to unexpected behaviors.
> Issues running in dynamic / managed environments
> ------------------------------------------------
>
> Key: HADOOP-18396
> URL: https://issues.apache.org/jira/browse/HADOOP-18396
> Project: Hadoop Common
> Issue Type: Improvement
> Affects Versions: 3.4.0, 3.3.9, 3.3.4
> Environment: Running an HA configuration in Kubernetes, using Java 11.
> Reporter: Steve Vaughan
> Assignee: Steve Vaughan
> Priority: Major
>
> Running in dynamic or managed environments is a challenge because we can't
> assume that all services will have DNS entries, will be started in a specific
> order, will maintain constant IP addresses, etc. I'm using the following
> assumptions to guide the changes necessary to operate in this kind of
> environment:
> # The configuration files are an expression of desired state
> # If a referenced service instance is not resolvable or reachable at a
> moment in time, it will be eventually and should be able to participate in
> the future, as if it had been there originally, without requiring manual
> intervention
> # IP address changes should be handled in a way that no only allows
> distributed calls to continue to function, but avoids having to re-resolve
> the address over and over
> # Code that requires resolved names (Kerberos and DataNode registration)
> should fall back to DNS reverse lookups to work around temporary issues
> caused by caching. Example: The DataNode registration is only performed at
> startup, and yet the extra check that allows it to succeed in registering
> with the NameNode isn’t performed
> # If an HA system is supposed to only require a quorum, then we shouldn’t
> require the full set, allowing the called service to bring the remaining
> instances into compliance
> # Managing a service should be independent of other services. Example: You
> should be able to perform a rolling restart of JournalNodes without worrying
> about causing an issue with NameNodes as long as a quorum is present.
> A proof of these concepts would be the ability to:
> * Start with less that the full replica count of a service, while still
> providing the required quorum or minimal count, should still allow a cluster
> to start and function. Example: 2 out of 3 configured JournalNodes should
> still allow the NameNode to format, function, rollover to the standby, etc.
> * Introduce missing instances should join the existing cluster without
> manual intervention. Example: Starting the 3rd JournalNode should
> automatically be formatted and brought up to date
> * Perform rolling restarts of individual services without negatively
> impacting other services (causing failures, restarts, etc.). Example:
> Rolling restarts of JournalNodes shouldn't cause problems in NameNodes;
> Rolling restarts of NameNodes shouldn't cause problems with DataNodes
> * Logs should only report updated IP addresses once (per dependent),
> avoiding costly re-resolution
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]