[ 
https://issues.apache.org/jira/browse/HADOOP-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623500#comment-14623500
 ] 

Steve Scaffidi commented on HADOOP-12217:
-----------------------------------------

Oh! Also, using Murmur Hash here would not help, since the root of the problem 
is related to getting the hashCode of an object representing a number, which 
should (usually) involve simply using the number itself - typically a highly 
performant operation. For numeric types whose representation is larger than an 
int (32 bits), a typical solution is to simply "or" the number by itself, 
shifted by 32 bits, repeating until you have only 32 bits left, then return 
that value.

Luckily this operation is not only extremely cheap to perform, it's already 
built-in to Java's Double class, and documented here for those who might need 
to know how it's done:
  http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#hashCode%28%29

As the hashCode method in DoubleWritable currently does, for whole numbers 
between +/-MAX_INT (more or less), casting a double's bitwise representation to 
an int removes almost all of the significant (in the sense of significant to 
representing the desired numerical value) bits! The implementation at the link 
above does the right thing, and provides far better distribution for assigning 
buckets in a HashMap.

I would personally want to find out why fixing hashCode() in Hadoop's 
DoubleWritable breaks bucketing in Hive - I have some suspicions as to how that 
might happen, but I would need more info about exactly what you found.


> hashCode in DoubleWritable returns same value for many numbers
> --------------------------------------------------------------
>
>                 Key: HADOOP-12217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.19.0, 0.19.1, 0.20.0, 
> 0.20.1, 0.20.2, 0.20.203.0, 0.20.204.0, 0.20.205.0, 1.0.0, 1.0.1, 1.0.2, 
> 1.0.3, 1.0.4, 1.1.0, 1.1.1, 1.2.0, 0.21.0, 0.22.0, 0.23.0, 0.23.1, 0.23.3, 
> 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 0.23.4, 2.0.3-alpha, 0.23.5, 0.23.6, 
> 1.1.2, 0.23.7, 2.1.0-beta, 2.0.4-alpha, 0.23.8, 1.2.1, 2.0.5-alpha, 0.23.9, 
> 0.23.10, 0.23.11, 2.1.1-beta, 2.0.6-alpha, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.4.1, 
> 2.5.1, 2.5.2, 2.6.0, 2.7.0, 2.7.1
>            Reporter: Steve Scaffidi
>              Labels: easyfix
>
> Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the 
> keys in a HashMap results in abysmal performance, due to hash code collisions.
> I discovered this when testing the latest version of Hive and certain mapjoin 
> queries were exceedingly slow.
> Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable 
> that overrode used to override hashCode() with a correct implementation, but 
> for some reason they recently removed that code, so it now uses the incorrect 
> hashCode() method inherited from Hadoop's DoubleWritable.
> It appears that this bug has been there since DoubleWritable was 
> created(wow!) so I can understand if fixing it is impractical due to the 
> possibility of breaking things down-stream, but I can't think of anything 
> that *should* break, off the top of my head.
> Searching JIRA, I found several related tickets, which may be useful for some 
> historical perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, 
> HIVE-7041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to