kosiew commented on code in PR #22919:
URL: https://github.com/apache/datafusion/pull/22919#discussion_r3426459967


##########
benchmarks/sql_benchmarks/predicate_eval/load/corrproxy.sql:
##########
@@ -0,0 +1,30 @@
+-- Correlated-proxy dataset: a cheap integer predicate that is a perfect proxy
+-- for three string predicates, plus one independent string predicate.
+--
+--   c0    = 1 for ~30% of rows (cheap proxy)
+--   s1    contains 'aaa', 'ccc' and 'ddd' exactly where c0 = 1  (correlated)
+--   s2    contains 'bbb' for an independent ~30% of rows        (independent)
+--
+-- Marginally, the four regex predicates are indistinguishable: similar cost,
+-- the same ~30% selectivity. Their *conditional* selectivities behind the
+-- proxy differ completely: after `c0 = 1`, the three s1 regexes keep every
+-- survivor (each re-tests the proxy's condition) while the s2 regex still
+-- discards ~70%. Only joint statistics can see that; an independence
+-- assumption prices all four regexes identically in every position.
+--
+-- PRED_FILL sets the filler width around each marker (a non-matching
+-- `regexp_like` must scan the whole value), and PRED_ROWS sizes the table.
+CREATE TABLE t AS
+SELECT
+  CASE WHEN (value * 7) % 100 < 30 THEN 1 ELSE 0 END AS c0,
+  repeat('q', ${PRED_FILL:-30})
+    || CASE WHEN (value * 7) % 100 < 30 THEN 'aaa' ELSE 'zzz' END
+    || repeat('q', ${PRED_FILL:-30})
+    || CASE WHEN (value * 7) % 100 < 30 THEN 'ccc' ELSE 'zzz' END
+    || repeat('q', ${PRED_FILL:-30})
+    || CASE WHEN (value * 7) % 100 < 30 THEN 'ddd' ELSE 'zzz' END

Review Comment:
   Small readability suggestion: the proxy condition `(value * 7) % 100 < 30` 
is repeated four times here.
   
   It might be worth factoring it into a CTE or subquery and reusing a `proxy` 
boolean or int expression. That would make the intended perfect-proxy invariant 
more explicit and reduce the chance of drift if the threshold changes later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to