kosiew commented on code in PR #19420:
URL: https://github.com/apache/datafusion/pull/19420#discussion_r3158920954


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########


Review Comment:
   It looks like `with_new_children` does not preserve `preselection_threshold`.
   
   Right now it keeps `fail_on_overflow`, but it drops the custom threshold and 
resets it back to the default (0.2). Since optimizer rules frequently rebuild 
expression trees using this method, any user configured threshold will be lost 
after the first rewrite.
   
   Could you add `.with_preselection_threshold(self.preselection_threshold)` to 
the builder chain so the value is carried over?
   
   
   The `preselection_threshold` is not being serialized to protobuf in 
`datafusion/proto/src/physical_plan/to_proto.rs`
   
   At the moment, `PhysicalBinaryExprNode` has no field for it, and 
`to_proto.rs` does not emit it. This means the value is silently lost during 
plan serialization, which affects distributed execution, Flight SQL, and any 
persistence or caching.
   
   This should be added to the proto definition and wired through both 
serialization and deserialization. Without that, users may see inconsistent or 
incorrect behavior after a round trip.



##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -49,31 +49,44 @@ use kernels::{
     concat_elements_utf8view, regex_match_dyn, regex_match_dyn_scalar,
 };
 
+/// Default threshold for pre-selection optimization in AND operations.
+/// When the ratio of true values in the left-hand side is below this 
threshold,
+/// the RecordBatch will be filtered before evaluating the right-hand side.
+pub const DEFAULT_PRESELECTION_THRESHOLD: f32 = 0.2;
+
 /// Binary expression
-#[derive(Debug, Clone, Eq)]
+#[derive(Debug, Clone)]
 pub struct BinaryExpr {
     left: Arc<dyn PhysicalExpr>,
     op: Operator,
     right: Arc<dyn PhysicalExpr>,
     /// Specifies whether an error is returned on overflow or not
     fail_on_overflow: bool,
+    /// Threshold ratio (0.0 to 1.0) for pre-selection optimization in AND 
operations.
+    /// When the ratio of true values in the LHS is <= this threshold, 
pre-selection is applied.
+    /// Set to 0.0 to disable pre-selection, or 1.0 to always enable it for 
AND operations.
+    preselection_threshold: f32,
 }
 
-// Manually derive PartialEq and Hash to work around 
https://github.com/rust-lang/rust/issues/78808
+// Manually derive PartialEq, Eq and Hash to work around 
https://github.com/rust-lang/rust/issues/78808
+// and because f32 doesn't implement Eq
 impl PartialEq for BinaryExpr {
     fn eq(&self, other: &Self) -> bool {
         self.left.eq(&other.left)
             && self.op.eq(&other.op)
             && self.right.eq(&other.right)
             && self.fail_on_overflow.eq(&other.fail_on_overflow)
+            && self.preselection_threshold == other.preselection_threshold
     }
 }
+impl Eq for BinaryExpr {}

Review Comment:
   The `Eq` implementation can behave incorrectly for NaN threshold values.
   
   Right now it compares floats directly, but `NaN == NaN` is false, which 
breaks the reflexivity requirement of `Eq`. Meanwhile, the `Hash` 
implementation uses `to_bits`, which treats NaN as equal to itself, so the two 
are inconsistent.
   
   A couple of options here would be to reject NaN in 
`with_preselection_threshold`, or to switch equality to use `to_bits()` so it 
aligns with the hash behavior.



##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -89,16 +102,32 @@ impl BinaryExpr {
             op,
             right,
             fail_on_overflow: false,
+            preselection_threshold: DEFAULT_PRESELECTION_THRESHOLD,
         }
     }
 
     /// Create new binary expression with explicit fail_on_overflow value
     pub fn with_fail_on_overflow(self, fail_on_overflow: bool) -> Self {
         Self {
-            left: self.left,
-            op: self.op,
-            right: self.right,
             fail_on_overflow,
+            ..self
+        }
+    }
+
+    /// Set the pre-selection threshold for AND operations.
+    ///
+    /// When evaluating `lhs AND rhs`, if the ratio of true values in `lhs`
+    /// is less than or equal to this threshold, the RecordBatch will be
+    /// filtered before evaluating `rhs`, which can improve performance
+    /// when `rhs` is expensive to evaluate.
+    ///
+    /// - Set to `0.0` to disable pre-selection optimization
+    /// - Set to `1.0` to always apply pre-selection for AND operations
+    /// - Default is [`DEFAULT_PRESELECTION_THRESHOLD`] (0.2)
+    pub fn with_preselection_threshold(self, threshold: f32) -> Self {

Review Comment:
   There is currently no input validation in `with_preselection_threshold`.
   
   The docs say the value should be between 0.0 and 1.0, but the method accepts 
anything, including NaN, infinity, or negative values. Some of these lead to 
surprising behavior.
   
   It would be helpful to at least add a debug assertion, or even return a 
`Result` and validate the input.



##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -5091,47 +5143,150 @@ mod tests {
 
         // All nulls shouldn't short-circuit for AND or OR
         assert!(matches!(
-            check_short_circuit(&null_value, &Operator::And),
+            check_short_circuit(
+                &null_value,
+                &Operator::And,
+                DEFAULT_PRESELECTION_THRESHOLD
+            ),
             ShortCircuitStrategy::None
         ));
         assert!(matches!(
-            check_short_circuit(&null_value, &Operator::Or),
+            check_short_circuit(
+                &null_value,
+                &Operator::Or,
+                DEFAULT_PRESELECTION_THRESHOLD
+            ),
             ShortCircuitStrategy::None
         ));
 
         // Test with scalar values
         // Scalar true
         let scalar_true = 
ColumnarValue::Scalar(ScalarValue::Boolean(Some(true)));
         assert!(matches!(
-            check_short_circuit(&scalar_true, &Operator::Or),
+            check_short_circuit(
+                &scalar_true,
+                &Operator::Or,
+                DEFAULT_PRESELECTION_THRESHOLD
+            ),
             ShortCircuitStrategy::ReturnLeft
         )); // Should short-circuit OR
         assert!(matches!(
-            check_short_circuit(&scalar_true, &Operator::And),
+            check_short_circuit(
+                &scalar_true,
+                &Operator::And,
+                DEFAULT_PRESELECTION_THRESHOLD
+            ),
             ShortCircuitStrategy::ReturnRight
         )); // Should return the RHS for AND
 
         // Scalar false
         let scalar_false = 
ColumnarValue::Scalar(ScalarValue::Boolean(Some(false)));
         assert!(matches!(
-            check_short_circuit(&scalar_false, &Operator::And),
+            check_short_circuit(
+                &scalar_false,
+                &Operator::And,
+                DEFAULT_PRESELECTION_THRESHOLD
+            ),
             ShortCircuitStrategy::ReturnLeft
         )); // Should short-circuit AND
         assert!(matches!(
-            check_short_circuit(&scalar_false, &Operator::Or),
+            check_short_circuit(
+                &scalar_false,
+                &Operator::Or,
+                DEFAULT_PRESELECTION_THRESHOLD
+            ),
             ShortCircuitStrategy::ReturnRight
         )); // Should return the RHS for OR
 
         // Scalar null
         let scalar_null = ColumnarValue::Scalar(ScalarValue::Boolean(None));
         assert!(matches!(
-            check_short_circuit(&scalar_null, &Operator::And),
+            check_short_circuit(
+                &scalar_null,
+                &Operator::And,
+                DEFAULT_PRESELECTION_THRESHOLD
+            ),
+            ShortCircuitStrategy::None
+        ));
+        assert!(matches!(
+            check_short_circuit(
+                &scalar_null,
+                &Operator::Or,
+                DEFAULT_PRESELECTION_THRESHOLD
+            ),
+            ShortCircuitStrategy::None
+        ));
+    }
+
+    #[test]
+    fn test_preselection_threshold() {

Review Comment:
   The test for `preselection_threshold` does not exercise the full `evaluate` 
path.
   
   It would be great to add a test that builds a `BinaryExpr` with a custom 
threshold, runs `evaluate` on a batch with mixed values, and verifies that the 
RHS is only evaluated on the expected rows. This would also help catch issues 
like the `with_new_children` behavior.



##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -49,31 +49,44 @@ use kernels::{
     concat_elements_utf8view, regex_match_dyn, regex_match_dyn_scalar,
 };
 
+/// Default threshold for pre-selection optimization in AND operations.
+/// When the ratio of true values in the left-hand side is below this 
threshold,
+/// the RecordBatch will be filtered before evaluating the right-hand side.
+pub const DEFAULT_PRESELECTION_THRESHOLD: f32 = 0.2;

Review Comment:
   Now that `DEFAULT_PRESELECTION_THRESHOLD` is public, it might be worth 
adding a small note in the docs that this value is not guaranteed to remain 
stable across releases.
   
   That helps avoid downstream code accidentally depending on the exact value.



##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -49,31 +49,44 @@ use kernels::{
     concat_elements_utf8view, regex_match_dyn, regex_match_dyn_scalar,
 };
 
+/// Default threshold for pre-selection optimization in AND operations.
+/// When the ratio of true values in the left-hand side is below this 
threshold,

Review Comment:
   There is no obvious way for users to configure this via `SessionConfig` or a 
query level hint, which you already noted is an architectural limitation.
   
   Given that, the current builder approach makes sense. It might help to 
expand the docs with a short usage example showing how callers are expected to 
apply this, for example by transforming expressions on the physical plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to