HashPartitioner is not a safe partitioner for non-prime number of reducers,
particularly bad for 2^n, which seems to be a common use
------------------------------------------------------------------------------------------------------------------------------------
Key: PIG-2116
URL: https://issues.apache.org/jira/browse/PIG-2116
Project: Pig
Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Woody Anderson
the implementation of hashCode should not be assumed to be good.
in particular, the hashCode of String and List (used by Tuple) are very bad for
modulus 2^n.
we propose to add an additional perturbation of the int before doing the "%
reducers" bucketing.
HashMap.java uses this to prevent the String.hashCode from causing massive
bucket collisions etc. but that perturbation is targeted explicitly for a 2^n
number of buckets, which Pig is not doing in general.
we propose possibly using the final mixing step from murmur3.
here is some discussion of this issue for context:
This has some amusing implications: this hash is terrible for
2,4,8,16,31, and 32 reducers, so even in normal situations that's pretty
bad, especially if pig happens to pick 31 reducers because it has
104-106 mappers * 0.3.
31 is congruent to -1 mod 2^k for all 2 <= k <= 5, so in that case the hash is
effectively:
t[0]*(-1)^(n-1) + t[1]*(-1)^(n-2) + ... + t[n-2]*(-1) + t[n-1]
= (for odd n) t[0] - t[1] + t[2] - t[3] + t[4] + ...
So for example the string "mississippim" hashes to 0 (mod 2^32), as
every even input character is cancelled out by an equal odd input
elsewhere.
H = 0
for c in "mississippim":
H = H*31 + ord(c)
print "%c: H=%d (mod 32)" % (c, H%32)
m: H=13 (mod 32)
i: H=28 (mod 32)
s: H=23 (mod 32)
s: H=28 (mod 32)
i: H=13 (mod 32)
s: H=6 (mod 32)
s: H=13 (mod 32)
i: H=28 (mod 32)
p: H=20 (mod 32)
p: H=28 (mod 32)
i: H=13 (mod 32)
m: H=0 (mod 32)
Similarly with exactly 31 reducers, the hash function cancels out
entirely (31 is 0 mod 31, so everything but the last item is multiplied
by 0^i) and the result is simply the value of the last item.
A simple fix is to add a post-hash mixing step that nontrivially affects
the bits in the state over all other bits in the hash output, ideally
with probability 1/2 for all bits. That way the modulo doesn't
distribute across the whole function back to the input, and the internal
state of the hash above whatever modulus has some effect.
H = 0
for c in "mississippim":
H = H*31 + ord(c)
# these &0xffffffff ops are to simulate unsigned 32-bit math in python
H = H&0xffffffff
Hout = (H + (H<<3))&0xffffffff
Hout = Hout ^ (Hout>>11)
Hout = (Hout + (Hout<<15))&0xffffffff
print "%c: H=%08x === %d (mod 32)" % (c, Hout, Hout%32)
m: H=01ea83d5 === 21 (mod 32)
i: H=3d39fa73 === 19 (mod 32)
s: H=6c78d8d4 === 20 (mod 32)
s: H=3c76f555 === 21 (mod 32)
i: H=0abb25ff === 31 (mod 32)
s: H=40df81c9 === 9 (mod 32)
s: H=cfc8a427 === 7 (mod 32)
i: H=cea62c2b === 11 (mod 32)
p: H=4594d493 === 19 (mod 32)
p: H=f14b432a === 10 (mod 32)
i: H=169be0b0 === 16 (mod 32)
m: H=7d57b59c === 28 (mod 32)
The mixing step only needs to be done once at the end. The one I
inserted was stolen from Bob Jenkins' hash site, which is required
reading for anyone who decides to implement their own hashing.
Or you could use a real (good, fast, tested) hash function like murmur3.
-Andy
On Thu, Jun 02, 2011 at 03:37:56PM -0700, Woody Anderson wrote:
> This caught me off guard the other day, so i figured i'd pass it along:
>
> the hashCode implementation of Tuple and String have very specific expansions
> which do not provide a lot of hashCode variance mod 2^k when the elements are
> all equal.
>
> string:
> t[0]*31^(n-1) + t[1]*31^(n-2) + ... + t[n-1]
> tuple:
> ..(((31 + t[0])*31 + t[1])*31 + t[2])*31 + t[4]..
>
> this expansion modulo powers of 2 is degenerate if t[i] are all equal.
> eg. you group by (n0, n1) to do some work, and there are an unusually high
> number of tuples where n0 == n1, the value of n0/n1 makes no difference. this
> will equal 1 mod 16.
> the same goes if you're grouping by strings, and have a lot of "a", "aa",
> "aaaa", "b", "bb", "bbb", etc. type data
> this results in all the data ending up in a single reducer/part file. which
> is either a waste or going to kill your job.
> so, if you use 2^k reducers then that's a terrible group-by. and it's not
> going to be good (in general) for any non-prime.
>
> under 'normal' circumstances you probably won't notice this being a factor. I
> didn't notice until i used string.hashCode as part of a group-by to both
> group by my string an produce a semi-randomized output ordering (sherpa
> requirement); this completely blew up when simply grouping by the string
> hadn't.
>
> so, if you have highly varied data elements, this this is less of an issue,
> though a prime will usually generalize better, and you won't suddenly wonder
> about the bad dispersal you're getting.
> -w
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira