[jira] [Commented] (HADOOP-18396) Issues running in dynamic / managed environments

Steve Loughran (Jira) Wed, 10 Aug 2022 03:24:41 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577907#comment-17577907
 ]


Steve Loughran commented on HADOOP-18396:
-----------------------------------------

oh, and the whole yarn service lifecycle classes is a simplification of the 
smartfrog  distributed component architecture, where you push out a declarative 
spec of components to deploy, where to bind them, and the system ensures that 
the requirements are met. as well as predefined config options, e.g ports, 
components could publish their own state, which could then be lazy evaluated by 
others 

https://dl.acm.org/doi/10.1145/1496909.1496915
 
while that project is dead, my copy of the source is all there: 
https://github.com/steveloughran/smartfrog

I'm not going to advocate adoption, as in a k8s first world the units of 
deployment are now containers, not processes and/or components in processes. 
but the whole dynamic deployment problem is still there

> Issues running in dynamic / managed environments
> ------------------------------------------------
>
>                 Key: HADOOP-18396
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18396
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.4.0, 3.3.9, 3.3.4
>         Environment: Running an HA configuration in Kubernetes, using Java 11.
>            Reporter: Steve Vaughan
>            Assignee: Steve Vaughan
>            Priority: Major
>
> Running in dynamic or managed environments is a challenge because we can't 
> assume that all services will have DNS entries, will be started in a specific 
> order, will maintain constant IP addresses, etc.  I'm using the following 
> assumptions to guide the changes necessary to operate in this kind of 
> environment:
>  # The configuration files are an expression of desired state
>  # If a referenced service instance is not resolvable or reachable at a 
> moment in time, it will be eventually and should be able to participate in 
> the future, as if it had been there originally, without requiring manual 
> intervention
>  # IP address changes should be handled in a way that no only allows 
> distributed calls to continue to function, but avoids having to re-resolve 
> the address over and over
>  # Code that requires resolved names (Kerberos and DataNode registration) 
> should fall back to DNS reverse lookups to work around temporary issues 
> caused by caching.  Example: The DataNode registration is only performed at 
> startup, and yet the extra check that allows it to succeed in registering 
> with the NameNode isn’t performed
>  # If an HA system is supposed to only require a quorum, then we shouldn’t 
> require the full set, allowing the called service to bring the remaining 
> instances into compliance
>  # Managing a service should be independent of other services.  Example: You 
> should be able to perform a rolling restart of JournalNodes without worrying 
> about causing an issue with NameNodes as long as a quorum is present.
> A proof of these concepts would be the ability to:
>  * Start with less that the full replica count of a service, while still 
> providing the required quorum or minimal count, should still allow a cluster 
> to start and function.  Example: 2 out of 3 configured JournalNodes should 
> still allow the NameNode to format, function, rollover to the standby, etc.
>  * Introduce missing instances should join the existing cluster without 
> manual intervention.  Example: Starting the 3rd JournalNode should 
> automatically be formatted and brought up to date
>  * Perform rolling restarts of individual services without negatively 
> impacting other services (causing failures, restarts, etc.).  Example: 
> Rolling restarts of JournalNodes shouldn't cause problems in NameNodes; 
> Rolling restarts of NameNodes shouldn't cause problems with DataNodes
>  * Logs should only report updated IP addresses once (per dependent), 
> avoiding costly re-resolution



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18396) Issues running in dynamic / managed environments

Reply via email to