korbit-ai[bot] commented on code in PR #35042: URL: https://github.com/apache/superset/pull/35042#discussion_r2330659035
########## superset/utils/pandas.py: ########## @@ -0,0 +1,69 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +"""Pandas utilities for data processing.""" + +import pandas as pd + + +def detect_datetime_format(series: pd.Series, sample_size: int = 100) -> str | None: + """ + Detect the datetime format from a sample of the series. + + :param series: The pandas Series to analyze + :param sample_size: Number of rows to sample for format detection + :return: Detected format string or None if no consistent format found + """ + # Most common formats first for performance + common_formats = [ + "%Y-%m-%d %H:%M:%S", + "%Y-%m-%d", + "%Y-%m-%dT%H:%M:%S", + "%Y-%m-%dT%H:%M:%SZ", + "%Y-%m-%dT%H:%M:%S.%f", + "%Y-%m-%dT%H:%M:%S.%fZ", + "%m/%d/%Y", + "%d/%m/%Y", + "%Y/%m/%d", + "%m/%d/%Y %H:%M:%S", + "%d/%m/%Y %H:%M:%S", + "%m-%d-%Y", + "%d-%m-%Y", + "%Y%m%d", + ] Review Comment: ### Hardcoded datetime formats violate Open-Closed Principle <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The datetime format patterns are hardcoded within the function, making it difficult to extend or modify the supported formats without changing the function code. ###### Why this matters This violates the Open-Closed Principle (part of SOLID) and reduces flexibility. If new datetime formats need to be supported, the function must be modified rather than configured. ###### Suggested change ∙ *Feature Preview* Move the formats to a configuration that can be passed as a parameter or loaded from settings: ```python def detect_datetime_format( series: pd.Series, sample_size: int = 100, formats: list[str] | None = None ) -> str | None: common_formats = formats or DEFAULT_DATETIME_FORMATS # rest of the function ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/accd0688-025e-4f48-a174-9a7ec866a7d9/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/accd0688-025e-4f48-a174-9a7ec866a7d9?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/accd0688-025e-4f48-a174-9a7ec866a7d9?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/accd0688-025e-4f48-a174-9a7ec866a7d9?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/accd0688-025e-4f48-a174-9a7ec866a7d9) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:d903e843-1c28-4e62-abf5-5f428e8f4b65 --> [](d903e843-1c28-4e62-abf5-5f428e8f4b65) ########## superset/utils/core.py: ########## @@ -1858,6 +1860,62 @@ def get_legacy_time_column( ) +def _process_datetime_column( + df: pd.DataFrame, + col: DateColumn, +) -> None: + """Process a single datetime column with format detection.""" + if col.timestamp_format in ("epoch_s", "epoch_ms"): + dttm_series = df[col.col_label] + if is_numeric_dtype(dttm_series): + # Column is formatted as a numeric value + unit = col.timestamp_format.replace("epoch_", "") + df[col.col_label] = pd.to_datetime( + dttm_series, + utc=False, + unit=unit, + origin="unix", + errors="coerce", + exact=False, + ) + else: + # Column has already been formatted as a timestamp. + try: + df[col.col_label] = dttm_series.apply( + lambda x: pd.Timestamp(x) if pd.notna(x) else pd.NaT + ) Review Comment: ### Inefficient Row-by-Row Timestamp Processing <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? Using pandas `apply` with a lambda function to convert timestamps is inefficient. The `apply` operation performs row-by-row processing which is much slower than vectorized operations. ###### Why this matters Row-by-row processing in pandas is significantly slower than vectorized operations. This can cause serious performance issues when processing large dataframes. ###### Suggested change ∙ *Feature Preview* Use the vectorized `pd.to_datetime()` operation directly instead of apply: ```python df[col.col_label] = pd.to_datetime(dttm_series, errors='coerce') ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/741f5237-fb93-47ff-b6a9-f9c02c5dd62c/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/741f5237-fb93-47ff-b6a9-f9c02c5dd62c?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/741f5237-fb93-47ff-b6a9-f9c02c5dd62c?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/741f5237-fb93-47ff-b6a9-f9c02c5dd62c?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/741f5237-fb93-47ff-b6a9-f9c02c5dd62c) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:fd00d1e1-c580-43c3-9958-434477a310b6 --> [](fd00d1e1-c580-43c3-9958-434477a310b6) ########## superset/utils/pandas.py: ########## @@ -0,0 +1,69 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +"""Pandas utilities for data processing.""" + +import pandas as pd + + +def detect_datetime_format(series: pd.Series, sample_size: int = 100) -> str | None: + """ + Detect the datetime format from a sample of the series. + + :param series: The pandas Series to analyze + :param sample_size: Number of rows to sample for format detection + :return: Detected format string or None if no consistent format found + """ + # Most common formats first for performance + common_formats = [ + "%Y-%m-%d %H:%M:%S", + "%Y-%m-%d", + "%Y-%m-%dT%H:%M:%S", + "%Y-%m-%dT%H:%M:%SZ", + "%Y-%m-%dT%H:%M:%S.%f", + "%Y-%m-%dT%H:%M:%S.%fZ", + "%m/%d/%Y", + "%d/%m/%Y", + "%Y/%m/%d", + "%m/%d/%Y %H:%M:%S", + "%d/%m/%Y %H:%M:%S", + "%m-%d-%Y", + "%d-%m-%Y", + "%Y%m%d", + ] + + # Get non-null sample + sample = series.dropna().head(sample_size) Review Comment: ### Sample-Based Format Detection May Miss Variations <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The function may incorrectly identify a datetime format by only analyzing the first N rows, potentially missing format variations later in the series. ###### Why this matters If the datetime format changes after the sampled rows, the function will return a format that doesn't work for the entire dataset, leading to parsing errors when the format is later used. ###### Suggested change ∙ *Feature Preview* Use random sampling instead of head() to get a more representative sample: ```python # Get non-null sample if len(series) > sample_size: sample = series.dropna().sample(n=sample_size, random_state=42) else: sample = series.dropna() ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b8c1a69d-6cc6-4b7d-870c-da8cf9571de9/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b8c1a69d-6cc6-4b7d-870c-da8cf9571de9?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b8c1a69d-6cc6-4b7d-870c-da8cf9571de9?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b8c1a69d-6cc6-4b7d-870c-da8cf9571de9?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b8c1a69d-6cc6-4b7d-870c-da8cf9571de9) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:217952f9-b3fc-4e77-a2e3-2b8c74584fe4 --> [](217952f9-b3fc-4e77-a2e3-2b8c74584fe4) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
