andygrove opened a new issue, #21515:
URL: https://github.com/apache/datafusion/issues/21515

   ### Describe the bug
   
   The `datafusion-spark` `format_string` implementation produces different 
results from Apache Spark for `%t` timestamp format specifiers (`%tH`, `%tM`, 
`%tS`, `%tY`, etc.).
   
   Spark passes the raw timestamp microseconds to Java's `Formatter`, which 
interprets them as milliseconds. This is a known Spark quirk, but 
`datafusion-spark` should match Spark's behavior for compatibility.
   
   ### To Reproduce
   
   **PySpark (Spark behavior):**
   ```sql
   SELECT format_string('Hour: %tH', TIMESTAMP '2023-12-25 14:30:45');
   -- Hour: 10
   
   SELECT format_string('Year: %tY', TIMESTAMP '2023-12-25 14:30:45');
   -- Year: 55952
   
   SELECT format_string('Second: %tS', TIMESTAMP '2023-12-25 14:30:45');
   -- Second: 00
   ```
   
   Spark stores timestamps as microseconds since epoch but Java's `Formatter` 
interprets the value as milliseconds, producing these (surprising but 
consistent) results.
   
   **DataFusion-spark (current behavior):**
   
   The `.slt` tests at `string/format_string.slt` expect "correct" results like 
`Hour: 14`, `Year: 2023`, suggesting the DataFusion-spark implementation 
handles the timestamp properly rather than matching Spark's behavior.
   
   ### Expected behavior
   
   `datafusion-spark` should match Spark's output for `%t` format specifiers, 
even though Spark's behavior is arguably a bug. Spark compatibility is the goal.
   
   | Expression | Spark | datafusion-spark should match |
   |---|---|---|
   | `format_string('Hour: %tH', TIMESTAMP '2023-12-25 14:30:45')` | `Hour: 10` 
| `Hour: 10` |
   | `format_string('Year: %tY', TIMESTAMP '2023-12-25 14:30:45')` | `Year: 
55952` | `Year: 55952` |
   | `format_string('Second: %tS', TIMESTAMP '2023-12-25 14:30:45')` | `Second: 
00` | `Second: 00` |
   
   ### Additional context
   
   **Root cause in Spark:** Spark stores timestamps internally as microseconds 
since epoch. When `format_string` encounters a `%t` specifier, it passes the 
raw microsecond long value to Java's `java.util.Formatter`, which interprets it 
as milliseconds since epoch. This 1000x mismatch produces wrong-looking but 
consistent results.
   
   The `.slt` tests in `string/format_string.slt` (lines 356-579) have ~33 
expected values that need to be updated to match Spark's actual output. The 
expected values appear to have been generated by DataFusion's implementation 
rather than verified against Spark.
   
   This was discovered by running a PySpark validation script against the 
`.slt` test files (see #17045, #21508).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to