[
https://issues.apache.org/jira/browse/HADOOP-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624087#comment-14624087
]
Gopal V commented on HADOOP-12217:
----------------------------------
bq. I would personally want to find out why fixing hashCode() in Hadoop's
DoubleWritable breaks bucketing in Hive - I have some suspicions as to how that
might happen, but I would need more info about exactly what you found.
Touching the hashcode of any Writable breaks existing distributions in Hive -
the hash is used to distribute data to satisfy the BUCKETED BY operations in
DDLs.
The bucket map-joins and sorted-merge joins will give incorrect results after
you do something like this, because data will end up in different buckets for
the old and new data, when you upgrade hadoop.
Take a look at what happened to Varchar for instance - HIVE-8488
bq. Luckily this operation is not only extremely cheap to perform,
The new Optimized hashtable does *NOT* use the Writable::hashCode(), instead
uses a post-serialization hashcode (i.e murmur hash of the byte[] formed out of
the BinarySortableSerde). This is because allocating objects in the inner loop
results in allocator churn and frequent GC pauses - so it is cheaper to never
allocate a Double/DoubleWritable, particularly when they're going to be an L1
cache miss (Writable -> Double -> double).
The use of murmur came in as part of the L1 cache optimized hashtable in hive
0.14 (though it was committed the same month that 0.13 came out), which allows
us to pack about ~6x the number of k-v pairs in the same amount of memory
(DoubleWritable is way bigger than 8 bytes).
bq. Once I figure out where/why Hive's behavior changed, I'll file a ticket
there, too, if necessary, hopefully with useful patches
Please try the same queries in Tez mode and see whether you're hitting the same
issues - I suspect the core performance issues in MRv2 mode have mostly no
recourse, because they can't readjust during runtime (which is what the L1
cache optimized hash-join does).
> hashCode in DoubleWritable returns same value for many numbers
> --------------------------------------------------------------
>
> Key: HADOOP-12217
> URL: https://issues.apache.org/jira/browse/HADOOP-12217
> Project: Hadoop Common
> Issue Type: Bug
> Components: io
> Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.19.0, 0.19.1, 0.20.0,
> 0.20.1, 0.20.2, 0.20.203.0, 0.20.204.0, 0.20.205.0, 1.0.0, 1.0.1, 1.0.2,
> 1.0.3, 1.0.4, 1.1.0, 1.1.1, 1.2.0, 0.21.0, 0.22.0, 0.23.0, 0.23.1, 0.23.3,
> 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 0.23.4, 2.0.3-alpha, 0.23.5, 0.23.6,
> 1.1.2, 0.23.7, 2.1.0-beta, 2.0.4-alpha, 0.23.8, 1.2.1, 2.0.5-alpha, 0.23.9,
> 0.23.10, 0.23.11, 2.1.1-beta, 2.0.6-alpha, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.4.1,
> 2.5.1, 2.5.2, 2.6.0, 2.7.0, 2.7.1
> Reporter: Steve Scaffidi
> Labels: easyfix
> Attachments: HADOOP-12217.1.patch
>
>
> Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the
> keys in a HashMap results in abysmal performance, due to hash code collisions.
> I discovered this when testing the latest version of Hive and certain mapjoin
> queries were exceedingly slow.
> Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable
> that overrode used to override hashCode() with a correct implementation, but
> for some reason they recently removed that code, so it now uses the incorrect
> hashCode() method inherited from Hadoop's DoubleWritable.
> It appears that this bug has been there since DoubleWritable was
> created(wow!) so I can understand if fixing it is impractical due to the
> possibility of breaking things down-stream, but I can't think of anything
> that *should* break, off the top of my head.
> Searching JIRA, I found several related tickets, which may be useful for some
> historical perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629,
> HIVE-7041
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)