Re: [PR] bench: add correlated-proxy case to the predicate_eval suite [datafusion]

via GitHub Wed, 17 Jun 2026 00:23:36 -0700


kosiew commented on code in PR #22919:
URL: https://github.com/apache/datafusion/pull/22919#discussion_r3426308958



##########
benchmarks/sql_benchmarks/predicate_eval/load/corrproxy.sql:
##########
@@ -0,0 +1,30 @@
+-- Correlated-proxy dataset: a cheap integer predicate that is a perfect proxy
+-- for three string predicates, plus one independent string predicate.
+--
+--   c0    = 1 for ~30% of rows (cheap proxy)
+--   s1    contains 'aaa', 'ccc' and 'ddd' exactly where c0 = 1  (correlated)
+--   s2    contains 'bbb' for an independent ~30% of rows        (independent)
+--
+-- Marginally, the four regex predicates are indistinguishable: similar cost,
+-- the same ~30% selectivity. Their *conditional* selectivities behind the
+-- proxy differ completely: after `c0 = 1`, the three s1 regexes keep every
+-- survivor (each re-tests the proxy's condition) while the s2 regex still
+-- discards ~70%. Only joint statistics can see that; an independence
+-- assumption prices all four regexes identically in every position.
+--
+-- PRED_FILL sets the filler width around each marker (a non-matching
+-- `regexp_like` must scan the whole value), and PRED_ROWS sizes the table.
+CREATE TABLE t AS
+SELECT
+  CASE WHEN (value * 7) % 100 < 30 THEN 1 ELSE 0 END AS c0,
+  repeat('q', ${PRED_FILL:-30})
+    || CASE WHEN (value * 7) % 100 < 30 THEN 'aaa' ELSE 'zzz' END
+    || repeat('q', ${PRED_FILL:-30})
+    || CASE WHEN (value * 7) % 100 < 30 THEN 'ccc' ELSE 'zzz' END
+    || repeat('q', ${PRED_FILL:-30})
+    || CASE WHEN (value * 7) % 100 < 30 THEN 'ddd' ELSE 'zzz' END

Review Comment:
   Small readability suggestion: the proxy condition `(value * 7) % 100 < 30` 
is repeated four times here.
   
   It might be worth factoring it into a CTE or subquery and reusing a `proxy` 
boolean or int expression. That would make the intended perfect-proxy invariant 
more explicit and reduce the chance of drift if the threshold changes later.



##########
benchmarks/sql_benchmarks/predicate_eval/load/corrproxy.sql:
##########
@@ -0,0 +1,30 @@
+-- Correlated-proxy dataset: a cheap integer predicate that is a perfect proxy
+-- for three string predicates, plus one independent string predicate.
+--
+--   c0    = 1 for ~30% of rows (cheap proxy)
+--   s1    contains 'aaa', 'ccc' and 'ddd' exactly where c0 = 1  (correlated)
+--   s2    contains 'bbb' for an independent ~30% of rows        (independent)
+--
+-- Marginally, the four regex predicates are indistinguishable: similar cost,
+-- the same ~30% selectivity. Their *conditional* selectivities behind the
+-- proxy differ completely: after `c0 = 1`, the three s1 regexes keep every
+-- survivor (each re-tests the proxy's condition) while the s2 regex still
+-- discards ~70%. Only joint statistics can see that; an independence
+-- assumption prices all four regexes identically in every position.
+--
+-- PRED_FILL sets the filler width around each marker (a non-matching
+-- `regexp_like` must scan the whole value), and PRED_ROWS sizes the table.
+CREATE TABLE t AS
+SELECT
+  CASE WHEN (value * 7) % 100 < 30 THEN 1 ELSE 0 END AS c0,
+  repeat('q', ${PRED_FILL:-30})
+    || CASE WHEN (value * 7) % 100 < 30 THEN 'aaa' ELSE 'zzz' END
+    || repeat('q', ${PRED_FILL:-30})
+    || CASE WHEN (value * 7) % 100 < 30 THEN 'ccc' ELSE 'zzz' END
+    || repeat('q', ${PRED_FILL:-30})
+    || CASE WHEN (value * 7) % 100 < 30 THEN 'ddd' ELSE 'zzz' END
+    || repeat('q', ${PRED_FILL:-30}) AS s1,
+  repeat('q', ${PRED_FILL:-30})
+    || CASE WHEN (value * 13) % 100 < 30 THEN 'bbb' ELSE 'zzz' END
+    || repeat('q', ${PRED_FILL:-30}) AS s2
+FROM generate_series(1, ${PRED_ROWS:-1000000});

Review Comment:
   I think this needs a tweak before the benchmark can serve as the correlation 
headroom case.
   
   The key invariant here is that the four regex predicates should be 
marginally indistinguishable, so only joint or conditional stats can prefer 
`s2` after `c0`. Right now the generated data breaks that invariant: `s1` is 
roughly twice as wide as `s2`, the three `s1` markers occur at different 
offsets, and `s2` only has one marker after the first filler.
   
   That means a marginal cost estimator, or even runtime timing, could identify 
`regexp_like(s2, 'b.b')` as cheaper than most or all of the `s1` regexes 
without using joint stats. In that case q73 would no longer isolate 
correlation-aware ordering.
   
   Could we make the regex predicates have comparable marginal evaluation cost 
before using this as the correlation headroom case? For example, the strings 
could be equal width with equal marker offsets per predicate, or `s2` and the 
`s1` markers could be padded and positioned so a marginal-only system cannot 
distinguish them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] bench: add correlated-proxy case to the predicate_eval suite [datafusion]

Reply via email to