[PR] fix: support full-width and null characters, and negative scale in string to decimal [datafusion-comet]

via GitHub Thu, 09 Apr 2026 14:31:05 -0700


parthchandra opened a new pull request, #3922:
URL: https://github.com/apache/datafusion-comet/pull/3922


   ## Which issue does this PR close?
   
   Closes https://github.com/apache/datafusion-comet/issues/325.
   
   ## Rationale for this change
   Makes string to decimal spark compatible.
   
   
   ## What changes are included in this PR?
   
   Spark compatible implementations for string containing - 
   
   **Fullwidth digits (U+FF10–U+FF19)**                                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                
     Spark treats fullwidth digits as numeric equivalents — `"１２３.４５"` parses 
as `123.45`.                                                                    
                                                                                
                                                                 
   We previously used `.is_ascii_digit()` which rejected these as non-ASCII 
bytes.                                                                          
                                                                                
                                                       
                                                                                
                                                                                
                                                                                
                                                                    
     This adds a `normalize_fullwidth_digits()` that scans the UTF-8 byte 
stream for the 3-byte fullwidth digit                                           
                                                                                
                                                                                
      
     pattern `[0xEF, 0xBC, 0x90+n]` and replaces each with the corresponding 
ASCII byte `0x30+n`.                                                            
                                                                                
                                                                       
     A pure-ASCII fast path skips the allocation for the common case.           
                                                                                
                                                                                
                                                                    
                                                                                
                                                                                
                                                                                
                                                                    
     **Null bytes (`\u0000`)**                                                  
                                                                                
                                                                                
                                                                    
                                                                                
                                                                                
                                                                                
                                                                    
     Spark's `UTF8String` trims null bytes from both ends before parsing — 
`"123\u0000"` and                                                               
                                                                                
                                                                         
     `"\u0000123"` both parse as `123`. Null bytes in the middle produce `NULL`.
                                                                                
                                                                                
                                                                                
                                                                    
   We now trip 0x00 from both ends. Middle-position null bytes already fall 
through to `NULL` via                                                           
                                                                                
                                                                          
     the existing `is_ascii_digit()` check.                                     
                                                                                
                                                                                
                                                                    
                                                                                
                                                                                
                                                                                
                                                                    
     **Negative scale (`spark.sql.legacy.allowNegativeScaleOfDecimal`)**        
                                                                                
                                                                                
                                                                    
                                                                                
                                                                                
                                                                                
                                                                    
     When this legacy config is enabled, Spark allows `DECIMAL(p, s)` where `s 
< 0`, rounding                                                                  
                                                                                
                                                                     
     values to the nearest `10^|s|`.  This is already handled correctly. Added 
a new test.
   
   Also, this PR marks the cast as compatible and updates the compatibility 
guide.
   
   ## How are these changes tested?
   
   Unit tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: support full-width and null characters, and negative scale in string to decimal [datafusion-comet]

Reply via email to