kosiew commented on code in PR #19420:
URL: https://github.com/apache/datafusion/pull/19420#discussion_r3158920954
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
Review Comment:
It looks like `with_new_children` does not preserve `preselection_threshold`.
Right now it keeps `fail_on_overflow`, but it drops the custom threshold and
resets it back to the default (0.2). Since optimizer rules frequently rebuild
expression trees using this method, any user configured threshold will be lost
after the first rewrite.
Could you add `.with_preselection_threshold(self.preselection_threshold)` to
the builder chain so the value is carried over?
The `preselection_threshold` is not being serialized to protobuf in
`datafusion/proto/src/physical_plan/to_proto.rs`
At the moment, `PhysicalBinaryExprNode` has no field for it, and
`to_proto.rs` does not emit it. This means the value is silently lost during
plan serialization, which affects distributed execution, Flight SQL, and any
persistence or caching.
This should be added to the proto definition and wired through both
serialization and deserialization. Without that, users may see inconsistent or
incorrect behavior after a round trip.
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -49,31 +49,44 @@ use kernels::{
concat_elements_utf8view, regex_match_dyn, regex_match_dyn_scalar,
};
+/// Default threshold for pre-selection optimization in AND operations.
+/// When the ratio of true values in the left-hand side is below this
threshold,
+/// the RecordBatch will be filtered before evaluating the right-hand side.
+pub const DEFAULT_PRESELECTION_THRESHOLD: f32 = 0.2;
+
/// Binary expression
-#[derive(Debug, Clone, Eq)]
+#[derive(Debug, Clone)]
pub struct BinaryExpr {
left: Arc<dyn PhysicalExpr>,
op: Operator,
right: Arc<dyn PhysicalExpr>,
/// Specifies whether an error is returned on overflow or not
fail_on_overflow: bool,
+ /// Threshold ratio (0.0 to 1.0) for pre-selection optimization in AND
operations.
+ /// When the ratio of true values in the LHS is <= this threshold,
pre-selection is applied.
+ /// Set to 0.0 to disable pre-selection, or 1.0 to always enable it for
AND operations.
+ preselection_threshold: f32,
}
-// Manually derive PartialEq and Hash to work around
https://github.com/rust-lang/rust/issues/78808
+// Manually derive PartialEq, Eq and Hash to work around
https://github.com/rust-lang/rust/issues/78808
+// and because f32 doesn't implement Eq
impl PartialEq for BinaryExpr {
fn eq(&self, other: &Self) -> bool {
self.left.eq(&other.left)
&& self.op.eq(&other.op)
&& self.right.eq(&other.right)
&& self.fail_on_overflow.eq(&other.fail_on_overflow)
+ && self.preselection_threshold == other.preselection_threshold
}
}
+impl Eq for BinaryExpr {}
Review Comment:
The `Eq` implementation can behave incorrectly for NaN threshold values.
Right now it compares floats directly, but `NaN == NaN` is false, which
breaks the reflexivity requirement of `Eq`. Meanwhile, the `Hash`
implementation uses `to_bits`, which treats NaN as equal to itself, so the two
are inconsistent.
A couple of options here would be to reject NaN in
`with_preselection_threshold`, or to switch equality to use `to_bits()` so it
aligns with the hash behavior.
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -89,16 +102,32 @@ impl BinaryExpr {
op,
right,
fail_on_overflow: false,
+ preselection_threshold: DEFAULT_PRESELECTION_THRESHOLD,
}
}
/// Create new binary expression with explicit fail_on_overflow value
pub fn with_fail_on_overflow(self, fail_on_overflow: bool) -> Self {
Self {
- left: self.left,
- op: self.op,
- right: self.right,
fail_on_overflow,
+ ..self
+ }
+ }
+
+ /// Set the pre-selection threshold for AND operations.
+ ///
+ /// When evaluating `lhs AND rhs`, if the ratio of true values in `lhs`
+ /// is less than or equal to this threshold, the RecordBatch will be
+ /// filtered before evaluating `rhs`, which can improve performance
+ /// when `rhs` is expensive to evaluate.
+ ///
+ /// - Set to `0.0` to disable pre-selection optimization
+ /// - Set to `1.0` to always apply pre-selection for AND operations
+ /// - Default is [`DEFAULT_PRESELECTION_THRESHOLD`] (0.2)
+ pub fn with_preselection_threshold(self, threshold: f32) -> Self {
Review Comment:
There is currently no input validation in `with_preselection_threshold`.
The docs say the value should be between 0.0 and 1.0, but the method accepts
anything, including NaN, infinity, or negative values. Some of these lead to
surprising behavior.
It would be helpful to at least add a debug assertion, or even return a
`Result` and validate the input.
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -5091,47 +5143,150 @@ mod tests {
// All nulls shouldn't short-circuit for AND or OR
assert!(matches!(
- check_short_circuit(&null_value, &Operator::And),
+ check_short_circuit(
+ &null_value,
+ &Operator::And,
+ DEFAULT_PRESELECTION_THRESHOLD
+ ),
ShortCircuitStrategy::None
));
assert!(matches!(
- check_short_circuit(&null_value, &Operator::Or),
+ check_short_circuit(
+ &null_value,
+ &Operator::Or,
+ DEFAULT_PRESELECTION_THRESHOLD
+ ),
ShortCircuitStrategy::None
));
// Test with scalar values
// Scalar true
let scalar_true =
ColumnarValue::Scalar(ScalarValue::Boolean(Some(true)));
assert!(matches!(
- check_short_circuit(&scalar_true, &Operator::Or),
+ check_short_circuit(
+ &scalar_true,
+ &Operator::Or,
+ DEFAULT_PRESELECTION_THRESHOLD
+ ),
ShortCircuitStrategy::ReturnLeft
)); // Should short-circuit OR
assert!(matches!(
- check_short_circuit(&scalar_true, &Operator::And),
+ check_short_circuit(
+ &scalar_true,
+ &Operator::And,
+ DEFAULT_PRESELECTION_THRESHOLD
+ ),
ShortCircuitStrategy::ReturnRight
)); // Should return the RHS for AND
// Scalar false
let scalar_false =
ColumnarValue::Scalar(ScalarValue::Boolean(Some(false)));
assert!(matches!(
- check_short_circuit(&scalar_false, &Operator::And),
+ check_short_circuit(
+ &scalar_false,
+ &Operator::And,
+ DEFAULT_PRESELECTION_THRESHOLD
+ ),
ShortCircuitStrategy::ReturnLeft
)); // Should short-circuit AND
assert!(matches!(
- check_short_circuit(&scalar_false, &Operator::Or),
+ check_short_circuit(
+ &scalar_false,
+ &Operator::Or,
+ DEFAULT_PRESELECTION_THRESHOLD
+ ),
ShortCircuitStrategy::ReturnRight
)); // Should return the RHS for OR
// Scalar null
let scalar_null = ColumnarValue::Scalar(ScalarValue::Boolean(None));
assert!(matches!(
- check_short_circuit(&scalar_null, &Operator::And),
+ check_short_circuit(
+ &scalar_null,
+ &Operator::And,
+ DEFAULT_PRESELECTION_THRESHOLD
+ ),
+ ShortCircuitStrategy::None
+ ));
+ assert!(matches!(
+ check_short_circuit(
+ &scalar_null,
+ &Operator::Or,
+ DEFAULT_PRESELECTION_THRESHOLD
+ ),
+ ShortCircuitStrategy::None
+ ));
+ }
+
+ #[test]
+ fn test_preselection_threshold() {
Review Comment:
The test for `preselection_threshold` does not exercise the full `evaluate`
path.
It would be great to add a test that builds a `BinaryExpr` with a custom
threshold, runs `evaluate` on a batch with mixed values, and verifies that the
RHS is only evaluated on the expected rows. This would also help catch issues
like the `with_new_children` behavior.
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -49,31 +49,44 @@ use kernels::{
concat_elements_utf8view, regex_match_dyn, regex_match_dyn_scalar,
};
+/// Default threshold for pre-selection optimization in AND operations.
+/// When the ratio of true values in the left-hand side is below this
threshold,
+/// the RecordBatch will be filtered before evaluating the right-hand side.
+pub const DEFAULT_PRESELECTION_THRESHOLD: f32 = 0.2;
Review Comment:
Now that `DEFAULT_PRESELECTION_THRESHOLD` is public, it might be worth
adding a small note in the docs that this value is not guaranteed to remain
stable across releases.
That helps avoid downstream code accidentally depending on the exact value.
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -49,31 +49,44 @@ use kernels::{
concat_elements_utf8view, regex_match_dyn, regex_match_dyn_scalar,
};
+/// Default threshold for pre-selection optimization in AND operations.
+/// When the ratio of true values in the left-hand side is below this
threshold,
Review Comment:
There is no obvious way for users to configure this via `SessionConfig` or a
query level hint, which you already noted is an architectural limitation.
Given that, the current builder approach makes sense. It might help to
expand the docs with a short usage example showing how callers are expected to
apply this, for example by transforming expressions on the physical plan.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]