parthchandra opened a new pull request, #3922: URL: https://github.com/apache/datafusion-comet/pull/3922
## Which issue does this PR close? Closes https://github.com/apache/datafusion-comet/issues/325. ## Rationale for this change Makes string to decimal spark compatible. ## What changes are included in this PR? Spark compatible implementations for string containing - **Fullwidth digits (U+FF10–U+FF19)** Spark treats fullwidth digits as numeric equivalents — `"123.45"` parses as `123.45`. We previously used `.is_ascii_digit()` which rejected these as non-ASCII bytes. This adds a `normalize_fullwidth_digits()` that scans the UTF-8 byte stream for the 3-byte fullwidth digit pattern `[0xEF, 0xBC, 0x90+n]` and replaces each with the corresponding ASCII byte `0x30+n`. A pure-ASCII fast path skips the allocation for the common case. **Null bytes (`\u0000`)** Spark's `UTF8String` trims null bytes from both ends before parsing — `"123\u0000"` and `"\u0000123"` both parse as `123`. Null bytes in the middle produce `NULL`. We now trip 0x00 from both ends. Middle-position null bytes already fall through to `NULL` via the existing `is_ascii_digit()` check. **Negative scale (`spark.sql.legacy.allowNegativeScaleOfDecimal`)** When this legacy config is enabled, Spark allows `DECIMAL(p, s)` where `s < 0`, rounding values to the nearest `10^|s|`. This is already handled correctly. Added a new test. Also, this PR marks the cast as compatible and updates the compatibility guide. ## How are these changes tested? Unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
