[ 
https://issues.apache.org/jira/browse/HADOOP-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624087#comment-14624087
 ] 

Gopal V commented on HADOOP-12217:
----------------------------------

bq. I would personally want to find out why fixing hashCode() in Hadoop's 
DoubleWritable breaks bucketing in Hive - I have some suspicions as to how that 
might happen, but I would need more info about exactly what you found.

Touching the hashcode of any Writable breaks existing distributions in Hive - 
the hash is used to distribute data to satisfy the BUCKETED BY operations in 
DDLs.

The bucket map-joins and sorted-merge joins will give incorrect results after 
you do something like this, because data will end up in different buckets for 
the old and new data, when you upgrade hadoop.

Take a look at what happened to Varchar for instance - HIVE-8488

bq. Luckily this operation is not only extremely cheap to perform,

The new Optimized hashtable does *NOT* use the Writable::hashCode(), instead 
uses a post-serialization hashcode (i.e murmur hash of the byte[] formed out of 
the BinarySortableSerde). This is because allocating objects in the inner loop 
results in allocator churn and frequent GC  pauses - so it is cheaper to never 
allocate a Double/DoubleWritable, particularly when they're going to be an L1 
cache miss (Writable -> Double -> double).

The use of murmur came in as part of the L1 cache optimized hashtable in hive 
0.14 (though it was committed the same month that 0.13 came out), which allows 
us to pack about ~6x the number of k-v pairs in the same amount of memory 
(DoubleWritable is way bigger than 8 bytes).

bq.  Once I figure out where/why Hive's behavior changed, I'll file a ticket 
there, too, if necessary, hopefully with useful patches 

Please try the same queries in Tez mode and see whether you're hitting the same 
issues - I suspect the core performance issues in MRv2 mode have mostly no 
recourse, because they can't readjust during runtime (which is what the L1 
cache optimized hash-join does).

> hashCode in DoubleWritable returns same value for many numbers
> --------------------------------------------------------------
>
>                 Key: HADOOP-12217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.19.0, 0.19.1, 0.20.0, 
> 0.20.1, 0.20.2, 0.20.203.0, 0.20.204.0, 0.20.205.0, 1.0.0, 1.0.1, 1.0.2, 
> 1.0.3, 1.0.4, 1.1.0, 1.1.1, 1.2.0, 0.21.0, 0.22.0, 0.23.0, 0.23.1, 0.23.3, 
> 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 0.23.4, 2.0.3-alpha, 0.23.5, 0.23.6, 
> 1.1.2, 0.23.7, 2.1.0-beta, 2.0.4-alpha, 0.23.8, 1.2.1, 2.0.5-alpha, 0.23.9, 
> 0.23.10, 0.23.11, 2.1.1-beta, 2.0.6-alpha, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.4.1, 
> 2.5.1, 2.5.2, 2.6.0, 2.7.0, 2.7.1
>            Reporter: Steve Scaffidi
>              Labels: easyfix
>         Attachments: HADOOP-12217.1.patch
>
>
> Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the 
> keys in a HashMap results in abysmal performance, due to hash code collisions.
> I discovered this when testing the latest version of Hive and certain mapjoin 
> queries were exceedingly slow.
> Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable 
> that overrode used to override hashCode() with a correct implementation, but 
> for some reason they recently removed that code, so it now uses the incorrect 
> hashCode() method inherited from Hadoop's DoubleWritable.
> It appears that this bug has been there since DoubleWritable was 
> created(wow!) so I can understand if fixing it is impractical due to the 
> possibility of breaking things down-stream, but I can't think of anything 
> that *should* break, off the top of my head.
> Searching JIRA, I found several related tickets, which may be useful for some 
> historical perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, 
> HIVE-7041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to