Jim Hughes created FLINK-39399:
----------------------------------

             Summary: Integer overflow in HyperLogLogPlusPlus.query() causes 
APPROX_COUNT_DISTINCT undercount at high cardinality
                 Key: FLINK-39399
                 URL: https://issues.apache.org/jira/browse/FLINK-39399
             Project: Flink
          Issue Type: Bug
          Components: Table SQL / Runtime
            Reporter: Jim Hughes


APPROX_COUNT_DISTINCT produce incorrect (undercounted) results when the true 
cardinality exceeds ~100 million distinct values.

The root cause is an integer overflow in HyperLogLogPlusPlus.query(). In 
HyperLogLogPlusPlus.java line 4950:

{code:java}
zInverse += 1.0 / (1 << mIdx);
{code}

mIdx is the per-register value representing the position of the leading 1-bit 
in a hash. At high cardinality, some registers accumulate values >= 32. Because 
1 is an int, Java's << operator masks the shift count modulo 32, so 1 << 35 
silently evaluates to 1 << 3 = 8 instead of the correct 34359738368. This 
corrupts the zInverse accumulator and produces a dramatically wrong cardinality 
estimate.

*Fix:*

Change 1 << mIdx to 1L << mIdx so the shift uses long arithmetic (modulo 64, 
which is sufficient since register values max out around 51 for a 64-bit hash).

{code:java}
zInverse += 1.0 / (1L << mIdx);
{code}

*How to reproduce:*

Construct an HLL buffer where every register holds a value >= 32 (e.g., 35) and 
call query(). With the bug, the estimate is ~95K. With the fix, the estimate 
correctly reflects the large register values (~4e14). Alternatively, run 
APPROX_COUNT_DISTINCT over ~120M+ distinct values and observe the estimate is 
orders of magnitude too low.

(Diagnosed and generated with Claude code.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to