Re: [PR] fix(utils): Suppress pandas date parsing warnings in normalize_dttm_col [superset]

via GitHub Mon, 08 Sep 2025 08:51:17 -0700


korbit-ai[bot] commented on code in PR #35042:
URL: https://github.com/apache/superset/pull/35042#discussion_r2330659035



##########
superset/utils/pandas.py:
##########
@@ -0,0 +1,69 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Pandas utilities for data processing."""
+
+import pandas as pd
+
+
+def detect_datetime_format(series: pd.Series, sample_size: int = 100) -> str | 
None:
+    """
+    Detect the datetime format from a sample of the series.
+
+    :param series: The pandas Series to analyze
+    :param sample_size: Number of rows to sample for format detection
+    :return: Detected format string or None if no consistent format found
+    """
+    # Most common formats first for performance
+    common_formats = [
+        "%Y-%m-%d %H:%M:%S",
+        "%Y-%m-%d",
+        "%Y-%m-%dT%H:%M:%S",
+        "%Y-%m-%dT%H:%M:%SZ",
+        "%Y-%m-%dT%H:%M:%S.%f",
+        "%Y-%m-%dT%H:%M:%S.%fZ",
+        "%m/%d/%Y",
+        "%d/%m/%Y",
+        "%Y/%m/%d",
+        "%m/%d/%Y %H:%M:%S",
+        "%d/%m/%Y %H:%M:%S",
+        "%m-%d-%Y",
+        "%d-%m-%Y",
+        "%Y%m%d",
+    ]

Review Comment:
   ### Hardcoded datetime formats violate Open-Closed Principle <sub>![category 
Design](https://img.shields.io/badge/Design-0d9488)</sub>
   
   <details>
     <summary>Tell me more</summary>
   
   ###### What is the issue?
   The datetime format patterns are hardcoded within the function, making it 
difficult to extend or modify the supported formats without changing the 
function code.
   
   
   ###### Why this matters
   This violates the Open-Closed Principle (part of SOLID) and reduces 
flexibility. If new datetime formats need to be supported, the function must be 
modified rather than configured.
   
   ###### Suggested change ∙ *Feature Preview*
   Move the formats to a configuration that can be passed as a parameter or 
loaded from settings:
   ```python
   def detect_datetime_format(
       series: pd.Series, 
       sample_size: int = 100, 
       formats: list[str] | None = None
   ) -> str | None:
       common_formats = formats or DEFAULT_DATETIME_FORMATS
       # rest of the function
   ```
   
   
   ###### Provide feedback to improve future suggestions
   [![Nice 
Catch](https://img.shields.io/badge/👍%20Nice%20Catch-71BC78)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/accd0688-025e-4f48-a174-9a7ec866a7d9/upvote)
 
[![Incorrect](https://img.shields.io/badge/👎%20Incorrect-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/accd0688-025e-4f48-a174-9a7ec866a7d9?what_not_true=true)
  [![Not in 
Scope](https://img.shields.io/badge/👎%20Out%20of%20PR%20scope-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/accd0688-025e-4f48-a174-9a7ec866a7d9?what_out_of_scope=true)
 [![Not in coding 
standard](https://img.shields.io/badge/👎%20Not%20in%20our%20standards-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/accd0688-025e-4f48-a174-9a7ec866a7d9?what_not_in_standard=true)
 
[![Other](https://img.shields.io/badge/👎%20Other-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/accd0688-025e-4f48-a174-9a7ec866a7d9)
   </details>
   
   <sub>
   
   💬 Looking for more details? Reply to this comment to chat with Korbit.
   </sub>
   
   <!--- korbi internal id:d903e843-1c28-4e62-abf5-5f428e8f4b65 -->
   
   
   [](d903e843-1c28-4e62-abf5-5f428e8f4b65)



##########
superset/utils/core.py:
##########
@@ -1858,6 +1860,62 @@ def get_legacy_time_column(
         )
 
 
+def _process_datetime_column(
+    df: pd.DataFrame,
+    col: DateColumn,
+) -> None:
+    """Process a single datetime column with format detection."""
+    if col.timestamp_format in ("epoch_s", "epoch_ms"):
+        dttm_series = df[col.col_label]
+        if is_numeric_dtype(dttm_series):
+            # Column is formatted as a numeric value
+            unit = col.timestamp_format.replace("epoch_", "")
+            df[col.col_label] = pd.to_datetime(
+                dttm_series,
+                utc=False,
+                unit=unit,
+                origin="unix",
+                errors="coerce",
+                exact=False,
+            )
+        else:
+            # Column has already been formatted as a timestamp.
+            try:
+                df[col.col_label] = dttm_series.apply(
+                    lambda x: pd.Timestamp(x) if pd.notna(x) else pd.NaT
+                )

Review Comment:
   ### Inefficient Row-by-Row Timestamp Processing <sub>![category 
Performance](https://img.shields.io/badge/Performance-4f46e5)</sub>
   
   <details>
     <summary>Tell me more</summary>
   
   ###### What is the issue?
   Using pandas `apply` with a lambda function to convert timestamps is 
inefficient. The `apply` operation performs row-by-row processing which is much 
slower than vectorized operations.
   
   
   ###### Why this matters
   Row-by-row processing in pandas is significantly slower than vectorized 
operations. This can cause serious performance issues when processing large 
dataframes.
   
   ###### Suggested change ∙ *Feature Preview*
   Use the vectorized `pd.to_datetime()` operation directly instead of apply:
   ```python
   df[col.col_label] = pd.to_datetime(dttm_series, errors='coerce')
   ```
   
   
   ###### Provide feedback to improve future suggestions
   [![Nice 
Catch](https://img.shields.io/badge/👍%20Nice%20Catch-71BC78)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/741f5237-fb93-47ff-b6a9-f9c02c5dd62c/upvote)
 
[![Incorrect](https://img.shields.io/badge/👎%20Incorrect-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/741f5237-fb93-47ff-b6a9-f9c02c5dd62c?what_not_true=true)
  [![Not in 
Scope](https://img.shields.io/badge/👎%20Out%20of%20PR%20scope-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/741f5237-fb93-47ff-b6a9-f9c02c5dd62c?what_out_of_scope=true)
 [![Not in coding 
standard](https://img.shields.io/badge/👎%20Not%20in%20our%20standards-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/741f5237-fb93-47ff-b6a9-f9c02c5dd62c?what_not_in_standard=true)
 
[![Other](https://img.shields.io/badge/👎%20Other-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/741f5237-fb93-47ff-b6a9-f9c02c5dd62c)
   </details>
   
   <sub>
   
   💬 Looking for more details? Reply to this comment to chat with Korbit.
   </sub>
   
   <!--- korbi internal id:fd00d1e1-c580-43c3-9958-434477a310b6 -->
   
   
   [](fd00d1e1-c580-43c3-9958-434477a310b6)



##########
superset/utils/pandas.py:
##########
@@ -0,0 +1,69 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Pandas utilities for data processing."""
+
+import pandas as pd
+
+
+def detect_datetime_format(series: pd.Series, sample_size: int = 100) -> str | 
None:
+    """
+    Detect the datetime format from a sample of the series.
+
+    :param series: The pandas Series to analyze
+    :param sample_size: Number of rows to sample for format detection
+    :return: Detected format string or None if no consistent format found
+    """
+    # Most common formats first for performance
+    common_formats = [
+        "%Y-%m-%d %H:%M:%S",
+        "%Y-%m-%d",
+        "%Y-%m-%dT%H:%M:%S",
+        "%Y-%m-%dT%H:%M:%SZ",
+        "%Y-%m-%dT%H:%M:%S.%f",
+        "%Y-%m-%dT%H:%M:%S.%fZ",
+        "%m/%d/%Y",
+        "%d/%m/%Y",
+        "%Y/%m/%d",
+        "%m/%d/%Y %H:%M:%S",
+        "%d/%m/%Y %H:%M:%S",
+        "%m-%d-%Y",
+        "%d-%m-%Y",
+        "%Y%m%d",
+    ]
+
+    # Get non-null sample
+    sample = series.dropna().head(sample_size)

Review Comment:
   ### Sample-Based Format Detection May Miss Variations <sub>![category 
Functionality](https://img.shields.io/badge/Functionality-0284c7)</sub>
   
   <details>
     <summary>Tell me more</summary>
   
   ###### What is the issue?
   The function may incorrectly identify a datetime format by only analyzing 
the first N rows, potentially missing format variations later in the series.
   
   
   ###### Why this matters
   If the datetime format changes after the sampled rows, the function will 
return a format that doesn't work for the entire dataset, leading to parsing 
errors when the format is later used.
   
   ###### Suggested change ∙ *Feature Preview*
   Use random sampling instead of head() to get a more representative sample:
   ```python
   # Get non-null sample
   if len(series) > sample_size:
       sample = series.dropna().sample(n=sample_size, random_state=42)
   else:
       sample = series.dropna()
   ```
   
   
   ###### Provide feedback to improve future suggestions
   [![Nice 
Catch](https://img.shields.io/badge/👍%20Nice%20Catch-71BC78)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b8c1a69d-6cc6-4b7d-870c-da8cf9571de9/upvote)
 
[![Incorrect](https://img.shields.io/badge/👎%20Incorrect-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b8c1a69d-6cc6-4b7d-870c-da8cf9571de9?what_not_true=true)
  [![Not in 
Scope](https://img.shields.io/badge/👎%20Out%20of%20PR%20scope-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b8c1a69d-6cc6-4b7d-870c-da8cf9571de9?what_out_of_scope=true)
 [![Not in coding 
standard](https://img.shields.io/badge/👎%20Not%20in%20our%20standards-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b8c1a69d-6cc6-4b7d-870c-da8cf9571de9?what_not_in_standard=true)
 
[![Other](https://img.shields.io/badge/👎%20Other-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b8c1a69d-6cc6-4b7d-870c-da8cf9571de9)
   </details>
   
   <sub>
   
   💬 Looking for more details? Reply to this comment to chat with Korbit.
   </sub>
   
   <!--- korbi internal id:217952f9-b3fc-4e77-a2e3-2b8c74584fe4 -->
   
   
   [](217952f9-b3fc-4e77-a2e3-2b8c74584fe4)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(utils): Suppress pandas date parsing warnings in normalize_dttm_col [superset]

Reply via email to