[jira] [Commented] (HADOOP-11656) Classpath isolation for downstream clients

Colin Patrick McCabe (JIRA) Mon, 02 Mar 2015 10:56:07 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343565#comment-14343565
 ]

Colin Patrick McCabe commented on HADOOP-11656:
-----------------------------------------------

Thank you for filing this, [~busbey].  +1000 for fixing this... it is a huge 
pain point in Hadoop deployments.

bq. Steve wrote: There's another strategy, which is pure-REST-client. Why do we 
need an HDFS client using Hadoop IPC when we have webHDFS? Same for YARN? Even 
a YARN app shouldn't need to pull in yarn-*.jar, though there's enough IPC 
there & other things you probably would have to.

A pure REST client is slower than a pure java client, and can't do things like 
zero-copy reads, short circuit reads, and so forth.  Another way of realizing 
this is to see that httpfs and webfs have been around for a long time, and 
haven't solved this problem for our users.

bq. the other strategy "ultra-lean client" is appealing, though we're fairly 
contaminated with Guava, commons-logging, SLF4J, httpclient, commons-lang, etc. 
The notion of "single client JAR" is going to be hard to pull off without 
embracing Shading, and the wrongness that comes from that.

Guava is a really nice library.  It's nice on the server, and it's just as nice 
on the client.  We had this discussion earlier when someone attempted to remove 
Guava from the client... "that dog won't hunt."  And even if it did, we have 
Jackson, Protobuf, AmazonAWS, zookeeper, jersey, glassfish, avro, jetty, and on 
and on.

We *can't* solve this problem by minimizing dependencies.  Because even if we 
do a huge amount of code-worsening wheel-reinvention to get rid of our nice 
utility libraries, we still are stuck with dependencies like Protobuf and 
Jetty.  The Protobuf 2.4.1 -> 2.5.0 transition caused a huge amount of pain for 
users and developers.  And we all know about the security implications of using 
old libraries.  In a larger sense, good software architecture should involve 
code reuse and libraries when appropriate.  Treating dependencies as 
"contamination" will just result in more "not invented here" syndrome.  It 
doesn't scale.

bq. ps, we don't really make dependency promises. If you look at the Hadoop 
compatibility document, you can see we explicitly say "no guarantees". That's 
not an accident. We're just being somewhat cautious about updating things. If, 
say, HBase, accumulo & Oozie all wanted a co-ordinated update, we could try.

"Not making dependency promises" is just kicking the problem out to our users.  
It makes people unwilling to upgrade because they don't know if their code will 
be broken by the removal or alteration of a jar they need.  Case in point: 
Jackson 1.8.8 -> 1.9 broke a lot of user code because it removed 
{{defaultPrettyPrintingWriter}} and replaced it with a function called 
{{writerWithDefaultPrettyPrinter}}.  This is why some enterprise distros didn't 
pick up the change.

We have tried dependency harmonization in the past.  It doesn't work, because 
different projects have different release schedules and different needs.  Not 
to mention different communities.  Also, projects like HBase want to support 
multiple versions of Hadoop.  This means that they either have to live with 
mixed versions of things like Guava, Jetty, etc. or agree to never update 
dependencies.

bq. Do you propose writing your own classloader? If so, we're in trouble —based 
on my experience with every single classloader I have encountered. The 
consensus has gathered around OSGi not because it is any better than other 
people's attempts, it is simply no worse, and with "a standard", you the 
individual don't take the hit for: security problems, .class leakage, object 
equality breakage, classloader leakage, etc etc. Simple example, UGI relies on 
being a singleton for its identity management. Embrace classloaders and you 
have >1 UGI singleton, so had better be confident that their doAs identities 
worked as required.

Hadoop is a big project and worth the effort to manage our own CLASSPATH.  If 
there are problems we can work through them.  I am not opposed to OSGi but I 
think that is a separate discussion.

> Classpath isolation for downstream clients
> ------------------------------------------
>
>                 Key: HADOOP-11656
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11656
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>              Labels: classloading, classpath, dependencies
>
> Currently, Hadoop exposes downstream clients to a variety of third party 
> libraries. As our code base grows and matures we increase the set of 
> libraries we rely on. At the same time, as our user base grows we increase 
> the likelihood that some downstream project will run into a conflict while 
> attempting to use a different version of some library we depend on. This has 
> already happened with i.e. Guava several times for HBase, Accumulo, and Spark 
> (and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to 
> off and they don't do anything to help dependency conflicts on the driver 
> side or for folks talking to HDFS directly. This should serve as an umbrella 
> for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that 
> doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when 
> executing user provided code, whether client side in a launcher/driver or on 
> the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they want 
> to run substantially ahead or behind the versions we need and the project is 
> freer to change our own dependency versions because they'll no longer be in 
> our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases 
> written in the comments.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-11656) Classpath isolation for downstream clients

Reply via email to