[
https://issues.apache.org/jira/browse/SPARK-49858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937865#comment-17937865
]
Daniel Himmelstein commented on SPARK-49858:
--------------------------------------------
[~lamtran] I like the test cases you propose in the pull request:
{code:java}
test("SPARK-49858: inferring the schema with default timestampFormat") {
checkType(
Map("inferTimestamp" -> "true"), """{"a": "23456"}""", StringType)
checkType(Map("inferTimestamp" -> "true"), """{"a": "23456-01"}""",
StringType)
checkType(Map("inferTimestamp" -> "true"), """{"a": "2025-01-01"}""",
StringType)
checkType(
Map("inferTimestamp" -> "true"),
"""{"a": "2025-01-01T00:00:00.123+01:00"}""",
TimestampType)
}{code}
Inferring the 6 digit string to a timestamp by interpreting it as a year is
definitely wrong. No user will want that and if they did they could specify the
schema for that column themselves.
Appreciate your time in looking into this issue. I don't know the best
solution, since it also relates whether date fields are inferred (I see a
preferDate CSV setting).
> Pyspark JSON reader incorrectly considers a string of digits a timestamp and
> fails
> ----------------------------------------------------------------------------------
>
> Key: SPARK-49858
> URL: https://issues.apache.org/jira/browse/SPARK-49858
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.5.0
> Reporter: Daniel Himmelstein
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2025-03-22-17-20-24-495.png,
> image-2025-03-22-17-23-17-473.png
>
>
> With pyspark 3.5.0 the reading the following JSON will fail:
> {code:python}
> from pyspark.sql import SparkSessionspark =
> SparkSession.builder.appName("timestamp_test").getOrCreate()
> data = spark.sparkContext.parallelize(['{"field" : "23456"}'])
> df = (
> spark.read.option("inferTimestamp", True)
> # .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]")
> .json(path=data)
> )
> df.printSchema()
> df.collect()
> {code}
> The printSchema command shows that the field is parsed as a timestamp,
> causing the following error:
> {code:java}
> File
> ~/miniforge3/envs/facets/lib/python3.11/site-packages/pyspark/sql/types.py:282,
> in TimestampType.fromInternal(self, ts)
> 279 def fromInternal(self, ts: int) -> datetime.datetime:
> 280 if ts is not None:
> 281 # using int to avoid precision loss in float
> --> 282 return datetime.datetime.fromtimestamp(ts //
> 1000000).replace(microsecond=ts % 1000000)
> ValueError: year 23455 is out of range
> {code}
> If we uncomment the timestampFormat option, the command succeeds.
> I believe there are two issues:
> # that a string of digits with length > 4 is inferred to be a timestamp
> # that setting timestampFormat to the default according to [the
> documentation|https://spark.apache.org/docs/3.5.0/sql-data-sources-json.html]
> fixes the problem such that the documented default is not the actual default.
> This might be related to SPARK-45424.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]