[
https://issues.apache.org/jira/browse/HADOOP-14876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226300#comment-16226300
]
Anu Engineer commented on HADOOP-14876:
---------------------------------------
[~templedf] Thanks for putting in an effort to get this done. I really
appreciate all the thought that you have put into this document. I have some
minor suggestions.
* Use cases Matrix: We have nine states, It would be nice to have a matrix that
defines what changes in what release.
For example (based on InterfaceClassification.html), not suggesting that these
are the definitions, but add something that makes sense.
1. Public-Stable - Changes only in a major release.
2. Public-Evolving - Changes possible in major, minor release.
3. Public-Unstable - Only for Web-UI.
4. Limited-Stable - Changes possible in major, minor release.
5. Limited-Evolving - Changes possible in major, minor release.
6. Limited-Unstable - Changes possible in major, minor and maintenance release.
7. Private-* - Changes possible in major, minor and maintenance release.
* It would be good to define which kind of releases are possible -- major,
minor, and maintenance.
* Semantic compatibility
The semantics of the cluster is also defined by config files. The default
values of the settings and some new settings can change the semantics. We
should not break the compatibility in maintenance releases.
Currently, I am assuming that all Configs are public, but there are many that
do not have definitions in the default XML. We should mandate that these values
are not modified in maintenance releases.
Perhaps we should add a clause that states
"No new configuration shall be added which can change the behavior of an
existing cluster. For any new settings that are defined, care should be taken
to ensure that it does not change the behavior of existing clusters."
* "The list of client artifacts is as follows:" -- may I suggest that we add
the word "current" -- since someone could add new jar without breaking compact.
IMHO, The guarantee should be that we will not break existing code, if we
wanted to add a new JAR, it should be possible.
*Hadoop Env Vars: "that are meaningful to Hadoop" -- This is a very loose
definition. We should list out what will not change. Otherwise, all Hadoop
variables are game. If that is the intention, I suggest that we state that
explicitly.
* Native Dependencies: As a non-native English language speaker, I wonder if
this statement is ambiguous.
"Changes to the minimum required versions SHOULD NOT increase between minor
releases within a major version, though updates because of security issues,
license issues, or other reasons may occur."
Would we rewrite this as:
"Hadoop will strive to maintain the minimum required versions of external
dependencies stable during the lifetime of a major version. It is possible that
due to reasons like security, license or end-of-life of a component, etc. We
may be forced to upgrade."
* Protocol Dependencies: "The components of Apache Hadoop may have
dependencies that include their own protocols, such as Zookeeper, S3, Kerberos,
etc. These protocol dependencies SHALL be treated as internal protocols and
governed by the
same policy."
I don't think that we can treat S3 or Kerberos as internal protocols. I suggest
that we rewrite this as "To the extent possible, We will strive to maintain
same policies for external protocols(S3, Kerberos, etc.) that is used by
Hadoop."
* Transports: "Fixed service port numbers MUST be kept consistent to prevent
breaking clients." Did you mean to write, default service ports instead of
fixed?
* New transport mechanisms MUST only be introduced with minor or major version
changes.
Not sure why this constraint is placed, I am trying to understand how
introducing a new transport(assuming that older transports are stable) affects
compatibility?
* Log output: "Log messages are intended for human consumption, though
automation use cases are also supported." Not sure if this is intended, but
"automation use cases are also supported" seems to imply that log will be
parsable and stable. I am sure that is not what we want to offer. Should we
just remove the automation phrase?
* All log output SHALL be considered Public and Evolving
I worry this is not sustainable. Let me provide an example-- let us say I
search for a word, say block -- and now use that in a script which greps and
identifies an event. Someone adds a statement, which has the same word. My
parser stops working, even in a maintenance release. So in my mind, we should
tag all log output as private and unstable, and used only for human
consumption. If the intent is to specify that the log format will not change,
then we should specify the log format is the one not changing.
* HDFS Metadata: HDFS data nodes store data in a private directory structure.
The schema of that directory structure must remain stable to retain
compatibility.
If we have an upgrade path, I submit that this should be possible. In fact, I
think we should simply say, Upgrade and rollback of data stored in data node
should be possible.
* Command Line Interface -- More of a question. Are we sure that 3.0 release is
entirely complaint to this spec? For example, is the slaves.txt change covered
by this ? and if so is that change fully compatible?
* Hadoop Configuration Files: Please see my comment in the semantics section.
* Directory Structure: Changing the directory structure of these
user-accessible files can break compatibility, even in cases where the original
path is preserved via symbolic links.
Do you have a case where this has happened? If not, we should allow this
change. "user-accessible" is an extensive term. Does it mean all users along
with Admins? If it is admins, all files that we ship with Hadoop will fall into
the scope of this statement. So perhaps, we should define what this means, or
say that files accessed via protocols offered by HDFS (RPC and HTTP) will
remain stable.
* Operating Systems: We should have a full list of supported version documented
somewhere. Is there such a link? If so can you please add a pointer to this
document?
> Create downstream developer docs from the compatibility guidelines
> ------------------------------------------------------------------
>
> Key: HADOOP-14876
> URL: https://issues.apache.org/jira/browse/HADOOP-14876
> Project: Hadoop Common
> Issue Type: Improvement
> Components: documentation
> Affects Versions: 3.0.0-beta1
> Reporter: Daniel Templeton
> Assignee: Daniel Templeton
> Priority: Critical
> Attachments: Compatibility.pdf, DownstreamDev.pdf,
> HADOOP-14876.001.patch, HADOOP-14876.002.patch, HADOOP-14876.003.patch,
> HADOOP-14876.004.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]