kosiew commented on code in PR #22628:
URL: https://github.com/apache/datafusion/pull/22628#discussion_r3360204942
##########
datafusion/physical-expr-common/src/datum.rs:
##########
@@ -84,7 +90,147 @@ pub fn apply_cmp(
}
};
- apply(lhs, rhs, |l, r| Ok(Arc::new(f(l, r)?)))
+ let lhs = normalize_neg_zero(lhs);
+ let rhs = normalize_neg_zero(rhs);
+ apply(&lhs, &rhs, |l, r| Ok(Arc::new(f(l, r)?)))
+ }
+}
+
+/// Replace `-0.0` with `+0.0` on float inputs so that comparison operators
+/// follow IEEE 754 default semantics (where `-0.0 == +0.0`).
+///
+/// arrow-rs' comparison kernels intentionally use totalOrder semantics, which
+/// treats `-0.0` as strictly less than `+0.0` and not equal to it
+///
+/// See [`normalize_neg_zero_array`] / [`normalize_neg_zero_scalar`] for the
+/// per-variant behavior, including dictionary- and run-end-encoded arrays.
+pub fn normalize_neg_zero(value: &ColumnarValue) -> ColumnarValue {
+ match value {
+ ColumnarValue::Array(array) => {
+ ColumnarValue::Array(normalize_neg_zero_array(array))
+ }
+ ColumnarValue::Scalar(scalar) => {
+ ColumnarValue::Scalar(normalize_neg_zero_scalar(scalar))
+ }
+ }
+}
+
+/// Array variant of [`normalize_neg_zero`]. Returns the input unchanged for
+/// arrays that don't contain `-0.0` (no allocation) and for arrays whose
+/// (transitive) value type is not floating-point.
+///
+/// Dictionary- and run-end-encoded arrays are peeled to their inner values
+/// and rebuilt with normalized values when needed; the keys/run-ends are
+/// preserved.
+pub fn normalize_neg_zero_array(array: &ArrayRef) -> ArrayRef {
+ if !data_type_contains_float(array.data_type()) {
+ return Arc::clone(array);
+ }
+
+ match array.data_type() {
+ DataType::Float16 => {
Review Comment:
Small readability suggestion: the Float16, Float32, and Float64 arms all
repeat the same pattern of scanning for negative zero and then
unary-normalizing zero. A small helper or macro for primitive floats might make
this invariant easier to maintain and reduce the chance that one arm gets
updated differently later.
##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -370,7 +377,8 @@ impl PhysicalExpr for InListExpr {
if lhs_supports_arrow_eq
&& supports_arrow_eq(array.data_type())
{
- Ok(arrow_eq(&value, &array)?)
+ let rhs = normalize_neg_zero_array(&array);
+ Ok(arrow_eq(&value_normalized, &rhs)?)
} else {
let cmp = make_comparator(
Review Comment:
I think the fallback `IN` path still has an issue here. It builds
`make_comparator` from the original `value` and `array`, so types that skip
Arrow `eq`, such as the `RunEndEncoded` case mentioned in the comment, can
still end up using Arrow totalOrder semantics and treat `-0.0` as different
from `+0.0`.
Since this PR adds REE and dictionary-aware normalization helpers, could we
use `value_normalized` plus a normalized RHS in both fallback comparator
branches too? It would also be great to add a regression test for a REE float
`IN` input, assuming that type is supported in this path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]