[ 
https://issues.apache.org/jira/browse/SPARK-49858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937865#comment-17937865
 ] 

Daniel Himmelstein commented on SPARK-49858:
--------------------------------------------

[~lamtran] I like the test cases you propose in the pull request:

 
{code:java}
test("SPARK-49858: inferring the schema with default timestampFormat") {
  checkType(
    Map("inferTimestamp" -> "true"), """{"a": "23456"}""", StringType)
  checkType(Map("inferTimestamp" -> "true"), """{"a": "23456-01"}""", 
StringType)
  checkType(Map("inferTimestamp" -> "true"), """{"a": "2025-01-01"}""", 
StringType)
  checkType(
    Map("inferTimestamp" -> "true"),
    """{"a": "2025-01-01T00:00:00.123+01:00"}""",
    TimestampType)
}{code}
 

Inferring the 6 digit string to a timestamp by interpreting it as a year is 
definitely wrong. No user will want that and if they did they could specify the 
schema for that column themselves.

Appreciate your time in looking into this issue. I don't know the best 
solution, since it also relates whether date fields are inferred (I see a 
preferDate CSV setting).

 

> Pyspark JSON reader incorrectly considers a string of digits a timestamp and 
> fails
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-49858
>                 URL: https://issues.apache.org/jira/browse/SPARK-49858
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0
>            Reporter: Daniel Himmelstein
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2025-03-22-17-20-24-495.png, 
> image-2025-03-22-17-23-17-473.png
>
>
> With pyspark 3.5.0 the reading the following JSON will fail:
> {code:python}
> from pyspark.sql import SparkSessionspark = 
> SparkSession.builder.appName("timestamp_test").getOrCreate()
> data = spark.sparkContext.parallelize(['{"field" : "23456"}'])
> df = (
>     spark.read.option("inferTimestamp", True)
>     # .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]")
>     .json(path=data)
> )
> df.printSchema()
> df.collect()
> {code}
> The printSchema command shows that the field is parsed as a timestamp, 
> causing the following error:
> {code:java}
> File 
> ~/miniforge3/envs/facets/lib/python3.11/site-packages/pyspark/sql/types.py:282,
>  in TimestampType.fromInternal(self, ts)
>     279 def fromInternal(self, ts: int) -> datetime.datetime:
>     280     if ts is not None:
>     281         # using int to avoid precision loss in float
> --> 282         return datetime.datetime.fromtimestamp(ts // 
> 1000000).replace(microsecond=ts % 1000000)
> ValueError: year 23455 is out of range
> {code}
> If we uncomment the timestampFormat option, the command succeeds.
> I believe there are two issues:
>  # that a string of digits with length > 4 is inferred to be a timestamp
>  # that setting timestampFormat to the default according to [the 
> documentation|https://spark.apache.org/docs/3.5.0/sql-data-sources-json.html] 
> fixes the problem such that the documented default is not the actual default.
> This might be related to SPARK-45424.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to