parthchandra commented on code in PR #3892:
URL: https://github.com/apache/datafusion-comet/pull/3892#discussion_r3067242152


##########
spark/src/test/resources/sql-tests/expressions/aggregate/corr.sql:
##########
@@ -29,3 +28,13 @@ SELECT corr(x, y) FROM test_corr
 
 query tolerance=1e-6
 SELECT grp, corr(x, y) FROM test_corr GROUP BY grp ORDER BY grp
+
+-- Test permutations of NULL and NaN
+statement
+CREATE TABLE test_corr_nan(x double, y double, grp string) USING parquet
+
+statement
+INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as 
double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN' 
as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN' 
as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0, 
'null_val'), (1.0, NULL, 'val_null')

Review Comment:
   
   Maybe add a group with mixed nan and valid rows ( eg ` [(NaN, NaN), (1.0, 
2.0), (3.0, 4.0)]` )



##########
spark/src/main/scala/org/apache/comet/serde/aggregates.scala:
##########
@@ -584,20 +584,18 @@ object CometStddevPop extends 
CometAggregateExpressionSerde[StddevPop] with Come
 }
 
 object CometCorr extends CometAggregateExpressionSerde[Corr] {
-
-  override def getSupportLevel(expr: Corr): SupportLevel =

Review Comment:
   Claude flagged some edge cases we can document -
   ```
    ▎ 1. Legacy mode: When spark.sql.legacy.statisticalAggregate=true, 
nullOnDivideByZero is false and Spark returns NaN for the n=1 case. With this 
workaround, Comet would return null instead (because the NaN row gets skipped → 
n=0). Should we add a getSupportLevel guard that returns Incompatible when
     corr.nullOnDivideByZero is false? Or at least document this?
    ▎ 2. Mixed groups: For a group containing (NaN, NaN) alongside valid pairs 
like (1.0, 2.0), Spark returns NaN (NaN contaminates the accumulator), while 
this workaround would skip the NaN row and compute a valid correlation over the 
remaining rows. Is that a known limitation we're OK with?
    ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to