parthchandra commented on code in PR #3892:
URL: https://github.com/apache/datafusion-comet/pull/3892#discussion_r3067242152
##########
spark/src/test/resources/sql-tests/expressions/aggregate/corr.sql:
##########
@@ -29,3 +28,13 @@ SELECT corr(x, y) FROM test_corr
query tolerance=1e-6
SELECT grp, corr(x, y) FROM test_corr GROUP BY grp ORDER BY grp
+
+-- Test permutations of NULL and NaN
+statement
+CREATE TABLE test_corr_nan(x double, y double, grp string) USING parquet
+
+statement
+INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as
double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN'
as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN'
as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0,
'null_val'), (1.0, NULL, 'val_null')
Review Comment:
Maybe add a group with mixed nan and valid rows ( eg ` [(NaN, NaN), (1.0,
2.0), (3.0, 4.0)]` )
##########
spark/src/main/scala/org/apache/comet/serde/aggregates.scala:
##########
@@ -584,20 +584,18 @@ object CometStddevPop extends
CometAggregateExpressionSerde[StddevPop] with Come
}
object CometCorr extends CometAggregateExpressionSerde[Corr] {
-
- override def getSupportLevel(expr: Corr): SupportLevel =
Review Comment:
Claude flagged some edge cases we can document -
```
▎ 1. Legacy mode: When spark.sql.legacy.statisticalAggregate=true,
nullOnDivideByZero is false and Spark returns NaN for the n=1 case. With this
workaround, Comet would return null instead (because the NaN row gets skipped →
n=0). Should we add a getSupportLevel guard that returns Incompatible when
corr.nullOnDivideByZero is false? Or at least document this?
▎ 2. Mixed groups: For a group containing (NaN, NaN) alongside valid pairs
like (1.0, 2.0), Spark returns NaN (NaN contaminates the accumulator), while
this workaround would skip the NaN row and compute a valid correlation over the
remaining rows. Is that a known limitation we're OK with?
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]