Jim Hughes created FLINK-39399:
----------------------------------
Summary: Integer overflow in HyperLogLogPlusPlus.query() causes
APPROX_COUNT_DISTINCT undercount at high cardinality
Key: FLINK-39399
URL: https://issues.apache.org/jira/browse/FLINK-39399
Project: Flink
Issue Type: Bug
Components: Table SQL / Runtime
Reporter: Jim Hughes
APPROX_COUNT_DISTINCT produce incorrect (undercounted) results when the true
cardinality exceeds ~100 million distinct values.
The root cause is an integer overflow in HyperLogLogPlusPlus.query(). In
HyperLogLogPlusPlus.java line 4950:
{code:java}
zInverse += 1.0 / (1 << mIdx);
{code}
mIdx is the per-register value representing the position of the leading 1-bit
in a hash. At high cardinality, some registers accumulate values >= 32. Because
1 is an int, Java's << operator masks the shift count modulo 32, so 1 << 35
silently evaluates to 1 << 3 = 8 instead of the correct 34359738368. This
corrupts the zInverse accumulator and produces a dramatically wrong cardinality
estimate.
*Fix:*
Change 1 << mIdx to 1L << mIdx so the shift uses long arithmetic (modulo 64,
which is sufficient since register values max out around 51 for a 64-bit hash).
{code:java}
zInverse += 1.0 / (1L << mIdx);
{code}
*How to reproduce:*
Construct an HLL buffer where every register holds a value >= 32 (e.g., 35) and
call query(). With the bug, the estimate is ~95K. With the fix, the estimate
correctly reflects the large register values (~4e14). Alternatively, run
APPROX_COUNT_DISTINCT over ~120M+ distinct values and observe the estimate is
orders of magnitude too low.
(Diagnosed and generated with Claude code.)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)