andygrove opened a new issue, #21510:
URL: https://github.com/apache/datafusion/issues/21510

   ### Describe the bug
   
   The `datafusion-spark` implementation of `substring` does not match Apache 
Spark behavior when the negative start position exceeds the string length. 
DataFusion-spark clamps to position 1 and returns a full-length result, while 
Spark reduces the available length based on how far before position 1 the start 
is.
   
   This was discovered by running a PySpark validation script against the 
`.slt` test files (see #17045, #21508).
   
   ### To Reproduce
   
   The `.slt` test at 
`datafusion/sqllogictest/test_files/spark/string/substring.slt` line 138 
contains:
   
   ```sql
   SELECT substring('Spark SQL', -300, 3);
   ```
   
   The test expects `Spa`, but Apache Spark returns an empty string.
   
   ### Expected behavior
   
   `substring` should match Spark's semantics for negative start positions:
   
   | Expression | Spark result | datafusion-spark result |
   |---|---|---|
   | `substring('Spark SQL', -9, 3)` | `Spa` | `Spa` ✓ |
   | `substring('Spark SQL', -10, 3)` | `Sp` | (likely `Spa`) |
   | `substring('Spark SQL', -11, 3)` | `S` | (likely `Spa`) |
   | `substring('Spark SQL', -12, 3)` | `` (empty) | (likely `Spa`) |
   | `substring('Spark SQL', -300, 3)` | `` (empty) | `Spa` ✗ |
   
   Spark's behavior: for negative `start`, the effective position is `len(str) 
+ start + 1`. When this position is before 1, the available length is reduced 
by the overshoot. When `start + length` doesn't reach position 1, the result is 
empty.
   
   ### Additional context
   
   The same bug affects `substr` (alias for `substring`). The corresponding 
`.slt` test at line 189 also has wrong expected values for the same reason.
   
   The `.slt` expected values at lines 138 and 189 will need to be updated 
along with the implementation fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to