[jira] [Commented] (HADOOP-18396) Issues running in dynamic / managed environments

Steve Loughran (Jira) Wed, 10 Aug 2022 03:38:07 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577911#comment-17577911
 ]


Steve Loughran commented on HADOOP-18396:
-----------------------------------------

Looking at the list, as well as all the need to cope with moving IP addresses, 
is the sole bit of the stack today supporting dynamic discovery is based on 
zookeeper. Note that the yarn registry is designed to support dynamic service 
discovery, but again, it is Z.K. based. It will be interesting to see if the 
same registry/look up mechanisms could be supported by other back and such as 
dynamo DB. Remember, the registry itself can support DNS look up so it would be 
a matter of changing what it binds to. (oh look! someone has coded in a lot of 
the dynamicness people need in cloud and nobody else has noticed!). It might be 
interesting to see if the lower level ZK APIs, directly or through curator, 
would support a back end which worked with persistent cloud databases.

> Issues running in dynamic / managed environments
> ------------------------------------------------
>
>                 Key: HADOOP-18396
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18396
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.4.0, 3.3.9, 3.3.4
>         Environment: Running an HA configuration in Kubernetes, using Java 11.
>            Reporter: Steve Vaughan
>            Assignee: Steve Vaughan
>            Priority: Major
>
> Running in dynamic or managed environments is a challenge because we can't 
> assume that all services will have DNS entries, will be started in a specific 
> order, will maintain constant IP addresses, etc.  I'm using the following 
> assumptions to guide the changes necessary to operate in this kind of 
> environment:
>  # The configuration files are an expression of desired state
>  # If a referenced service instance is not resolvable or reachable at a 
> moment in time, it will be eventually and should be able to participate in 
> the future, as if it had been there originally, without requiring manual 
> intervention
>  # IP address changes should be handled in a way that no only allows 
> distributed calls to continue to function, but avoids having to re-resolve 
> the address over and over
>  # Code that requires resolved names (Kerberos and DataNode registration) 
> should fall back to DNS reverse lookups to work around temporary issues 
> caused by caching.  Example: The DataNode registration is only performed at 
> startup, and yet the extra check that allows it to succeed in registering 
> with the NameNode isn’t performed
>  # If an HA system is supposed to only require a quorum, then we shouldn’t 
> require the full set, allowing the called service to bring the remaining 
> instances into compliance
>  # Managing a service should be independent of other services.  Example: You 
> should be able to perform a rolling restart of JournalNodes without worrying 
> about causing an issue with NameNodes as long as a quorum is present.
> A proof of these concepts would be the ability to:
>  * Start with less that the full replica count of a service, while still 
> providing the required quorum or minimal count, should still allow a cluster 
> to start and function.  Example: 2 out of 3 configured JournalNodes should 
> still allow the NameNode to format, function, rollover to the standby, etc.
>  * Introduce missing instances should join the existing cluster without 
> manual intervention.  Example: Starting the 3rd JournalNode should 
> automatically be formatted and brought up to date
>  * Perform rolling restarts of individual services without negatively 
> impacting other services (causing failures, restarts, etc.).  Example: 
> Rolling restarts of JournalNodes shouldn't cause problems in NameNodes; 
> Rolling restarts of NameNodes shouldn't cause problems with DataNodes
>  * Logs should only report updated IP addresses once (per dependent), 
> avoiding costly re-resolution



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18396) Issues running in dynamic / managed environments

Reply via email to