andygrove opened a new issue, #2611:
URL: https://github.com/apache/datafusion-comet/issues/2611

   ### What is the problem the feature request solves?
   
   ## Background
   
   The main goal of Comet is to accelerate Spark jobs and produce the same 
results that Spark would have produced.
   
   In some cases, there will always be known differences, and in those cases we 
should disable Comet by default but allow users to opt in and make sure that we 
document how Comet differs from Spark. Once example is regular expressions. 
Rust and Java have different regular expression engines and there will be 
differences in behavior in some cases.
   
   ## Limitations of Current Testing Approaches
   
   ### Too much use of makeParquetFileAllPrimitiveTypes
   
   Early versions of Comet only accelerated Parquet reads. `CometTestBase` has 
a method `makeParquetFileAllPrimitiveTypes` that is very specific to testing 
Parquet (using different combinations of logical and physical type, for 
example) but does not generate values that would be edge cases for testing 
expressions. For example, it does not generate negative floating point zero or 
min/max values for each numeric type. Over time, many tests for expressions 
continued to use `makeParquetFileAllPrimitiveTypes`, resulting in limited 
testing for edge cases.
   
   ### Hand Written Tests
   
   We write tests manually, with numerous testing styles. These tests often do 
not attempt to cover edge cases but just test for basic use cases.
   
   ### Limited Testing for Exceptions
   
   We generally evaluate expressions against multiple rows at a time. If one 
value is invalid then an exception will be thrown, but we are not testing that 
all invalid values would have been caught, or that the valid values would have 
returned the correct results. For expressions that are fallible, we should be 
testing one value at a time for correct handling of invalid values (in addition 
to testing multiple rows for valid cases).
   
   ### No code coverage or test coverage reporting
   
   How do we know if we have tests for all expressions? We would currently have 
to manually audit the code base.
   
   ## Proposed Improvements
   
   Now that we have mostly moved expression serde to the new framework, we have 
a list of supported expressions:
   
   ```scala
     val exprSerdeMap: Map[Class[_ <: Expression], CometExpressionSerde[_]] =
       mathExpressions ++ hashExpressions ++ stringExpressions ++
         conditionalExpressions ++ mapExpressions ++ predicateExpressions ++
         structExpressions ++ bitwiseExpressions ++ miscExpressions ++ 
arrayExpressions ++
         temporalExpressions ++ conversionExpressions
   ```
   
   I would like to propose that we take some of the ideas from 
[CometFuzz](https://github.com/apache/datafusion-comet/tree/main/fuzz-testing) 
and the current tests that extend `CometFuzzTestBase` and generate tests for 
each supported expression.
   
   We may need to add new methods to the `CometExpressionSerde` trait to help 
with testing. For example, we may need to add information about expression 
signatures and the types of inputs they can accept so that we can generate 
tests for all cases.
   
   There are multiple approaches to how we could implement this, and we will 
need to try some different ideas out, so I haven't added a specific proposal 
here. I have created a Google Doc where we can collaborate. We can also discuss 
in the comments in this issue.
   
   
https://docs.google.com/document/d/1hZ7osyfALVdjxL-3UzzhdTclM1j972bYSK7M6CuEFwg/edit?usp=sharing
 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to