[
https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343565#comment-14343565
]
Colin Patrick McCabe commented on HADOOP-11656:
-----------------------------------------------
Thank you for filing this, [~busbey]. +1000 for fixing this... it is a huge
pain point in Hadoop deployments.
bq. Steve wrote: There's another strategy, which is pure-REST-client. Why do we
need an HDFS client using Hadoop IPC when we have webHDFS? Same for YARN? Even
a YARN app shouldn't need to pull in yarn-*.jar, though there's enough IPC
there & other things you probably would have to.
A pure REST client is slower than a pure java client, and can't do things like
zero-copy reads, short circuit reads, and so forth. Another way of realizing
this is to see that httpfs and webfs have been around for a long time, and
haven't solved this problem for our users.
bq. the other strategy "ultra-lean client" is appealing, though we're fairly
contaminated with Guava, commons-logging, SLF4J, httpclient, commons-lang, etc.
The notion of "single client JAR" is going to be hard to pull off without
embracing Shading, and the wrongness that comes from that.
Guava is a really nice library. It's nice on the server, and it's just as nice
on the client. We had this discussion earlier when someone attempted to remove
Guava from the client... "that dog won't hunt." And even if it did, we have
Jackson, Protobuf, AmazonAWS, zookeeper, jersey, glassfish, avro, jetty, and on
and on.
We *can't* solve this problem by minimizing dependencies. Because even if we
do a huge amount of code-worsening wheel-reinvention to get rid of our nice
utility libraries, we still are stuck with dependencies like Protobuf and
Jetty. The Protobuf 2.4.1 -> 2.5.0 transition caused a huge amount of pain for
users and developers. And we all know about the security implications of using
old libraries. In a larger sense, good software architecture should involve
code reuse and libraries when appropriate. Treating dependencies as
"contamination" will just result in more "not invented here" syndrome. It
doesn't scale.
bq. ps, we don't really make dependency promises. If you look at the Hadoop
compatibility document, you can see we explicitly say "no guarantees". That's
not an accident. We're just being somewhat cautious about updating things. If,
say, HBase, accumulo & Oozie all wanted a co-ordinated update, we could try.
"Not making dependency promises" is just kicking the problem out to our users.
It makes people unwilling to upgrade because they don't know if their code will
be broken by the removal or alteration of a jar they need. Case in point:
Jackson 1.8.8 -> 1.9 broke a lot of user code because it removed
{{defaultPrettyPrintingWriter}} and replaced it with a function called
{{writerWithDefaultPrettyPrinter}}. This is why some enterprise distros didn't
pick up the change.
We have tried dependency harmonization in the past. It doesn't work, because
different projects have different release schedules and different needs. Not
to mention different communities. Also, projects like HBase want to support
multiple versions of Hadoop. This means that they either have to live with
mixed versions of things like Guava, Jetty, etc. or agree to never update
dependencies.
bq. Do you propose writing your own classloader? If so, we're in trouble —based
on my experience with every single classloader I have encountered. The
consensus has gathered around OSGi not because it is any better than other
people's attempts, it is simply no worse, and with "a standard", you the
individual don't take the hit for: security problems, .class leakage, object
equality breakage, classloader leakage, etc etc. Simple example, UGI relies on
being a singleton for its identity management. Embrace classloaders and you
have >1 UGI singleton, so had better be confident that their doAs identities
worked as required.
Hadoop is a big project and worth the effort to manage our own CLASSPATH. If
there are problems we can work through them. I am not opposed to OSGi but I
think that is a separate discussion.
> Classpath isolation for downstream clients
> ------------------------------------------
>
> Key: HADOOP-11656
> URL: https://issues.apache.org/jira/browse/HADOOP-11656
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: Sean Busbey
> Assignee: Sean Busbey
> Labels: classloading, classpath, dependencies
>
> Currently, Hadoop exposes downstream clients to a variety of third party
> libraries. As our code base grows and matures we increase the set of
> libraries we rely on. At the same time, as our user base grows we increase
> the likelihood that some downstream project will run into a conflict while
> attempting to use a different version of some library we depend on. This has
> already happened with i.e. Guava several times for HBase, Accumulo, and Spark
> (and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to
> off and they don't do anything to help dependency conflicts on the driver
> side or for folks talking to HDFS directly. This should serve as an umbrella
> for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that
> doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when
> executing user provided code, whether client side in a launcher/driver or on
> the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they want
> to run substantially ahead or behind the versions we need and the project is
> freer to change our own dependency versions because they'll no longer be in
> our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases
> written in the comments.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)