[ 
https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342656#comment-14342656
 ] 

Steve Loughran commented on HADOOP-11656:
-----------------------------------------

I see the need, appreciate the idea, know how much downstream projects will 
beneit.

but...  "we are allowed to break things in the 3.x line" is not the same as "we 
should break things in the 3.x line". I need to understand what the plan is 
here, especially as "little" things like HADOOP-11293 show that what is 
considered private is in fact used in things like YARN apps downstream, as is 
HttpServer2, AmIPFilter.

Do you propose writing your own classloader? If so, we're in trouble —based on 
my experience with *every single classloader I have encountered*. The consensus 
has gathered around OSGi not because it is any better than other people's 
attempts, it is simply no worse, and with "a standard", you the individual 
don't take the hit for: security problems, .class leakage, object equality 
breakage, classloader leakage, etc etc. Simple example, UGI relies on being a 
singleton for its identity management. Embrace classloaders and you have >1 UGI 
singleton, so had better be confident that their doAs identities worked as 
required.

the other strategy "ultra-lean client" is appealing, though we're fairly 
contaminated with Guava, commons-logging, SLF4J, httpclient, commons-lang, etc. 
The notion of "single client JAR" is going to be hard to pull off without 
embracing Shading, and the wrongness that comes from that.

There's another strategy, which is pure-REST-client. Why do we need an HDFS 
client using Hadoop IPC when we have webHDFS? Same for YARN? Even a YARN app 
shouldn't need to pull in yarn-*.jar, though there's enough IPC there & other 
things you probably would have to.

I know HDFS-6200 covers trying to do have a specific client JAR for HDFS, 
HADOOP-1815 for hadoop itself; I was talking with Haohui only last week on this 
topic, along with the specific topic of Jetty. (Actually I was proposing a 
facebook "down with Guava" page but that won't solve the problem at hand)

bq. Updating your dependencies is a straight forward task.

Only if you can determine them at compile time. Even the change where we moved 
s3n:// support out of hadoop-common and into hadoop-aws, including its 
transitive dependencies, was risky enough as it meant that anything assuming 
the s3n:// implementation was in hadoop-common & dependencies would break -a 
breakage that happens at runtime, not compile.

some POM-only modules, e.g "hadop-client-complete" are one tactic, another, 
more subtle, is to have hadoop-common as is, but have some thinner ones you can 
pull in "hadoop-lean-client". 

Returning to the matter in hand, you call out Guava. Is it the case that Guava 
is the specific pain-point? Because we all hate being stuck on Guava 11 —but 
have not upgraded because of downstream apps that would get broken if we moved 
off it. It may be that we can work together across the ASF projects & see how 
we can do a co-ordinated Guava update, —hopefully with less pain than the 2013 
protobuf update— and come up with a common strategy of dealing with Guava and 
other google libraries whose backwards compatibility story isn't great. 

Otherwise, I'm more in favour of lean clients with minimal dependencies 
(especially not Guava), with classpath isolation through OSGi an option. 
There's been a lot of historical work there, which could be restarted.

ps, we don't really make dependency promises. If you look at the [Hadoop 
compatibility 
document|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#Java_Classpath],
 you can see we explicitly say "no guarantees". That's not an accident. We're 
just being somewhat cautious about updating things. If, say, HBase, accumulo & 
Oozie all wanted a co-ordinated update, we could try. 

> Classpath isolation for downstream clients
> ------------------------------------------
>
>                 Key: HADOOP-11656
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11656
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>              Labels: classloading, classpath, dependencies
>
> Currently, Hadoop exposes downstream clients to a variety of third party 
> libraries. As our code base grows and matures we increase the set of 
> libraries we rely on. At the same time, as our user base grows we increase 
> the likelihood that some downstream project will run into a conflict while 
> attempting to use a different version of some library we depend on. This has 
> already happened with i.e. Guava several times for HBase, Accumulo, and Spark 
> (and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to 
> off and they don't do anything to help dependency conflicts on the driver 
> side or for folks talking to HDFS directly. This should serve as an umbrella 
> for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that 
> doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when 
> executing user provided code, whether client side in a launcher/driver or on 
> the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they want 
> to run substantially ahead or behind the versions we need and the project is 
> freer to change our own dependency versions because they'll no longer be in 
> our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases 
> written in the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to