aminghadersohi opened a new pull request, #35042:
URL: https://github.com/apache/superset/pull/35042
<!---
PR Title: fix(utils): Suppress pandas date parsing warnings in
normalize_dttm_col
-->
### SUMMARY
<!--- Describe the change below, including rationale and design decisions -->
This PR fixes high-volume pandas warnings that appear in production logs
when parsing datetime columns without explicit formats. The warning `"Could not
infer format, so each element will be parsed individually, falling back to
dateutil"` was flooding our monitoring systems and masking other important
issues.
**Root Cause:**
When `normalize_dttm_col()` processes datetime columns without an explicit
format, pandas attempts format inference. When this fails (due to mixed formats
or ambiguous data), it falls back to element-by-element parsing using dateutil,
triggering a warning for each operation.
**Solution:**
1. **Format Detection**: Added `detect_datetime_format()` function that
samples 100 rows to detect common date formats (ISO, US, EU, etc.)
2. **Vectorized Parsing**: When format is detected, use it explicitly for
~5x faster vectorized parsing
3. **Warning Suppression**: When formats are mixed/ambiguous, suppress the
warning while maintaining functionality
4. **Code Refactoring**: Extracted logic into `_process_datetime_column()`
helper to reduce complexity
**Performance Impact:**
- Consistent date formats: ~5x faster due to vectorized parsing
- Mixed formats: Same speed but no warning spam
- Detection overhead: Negligible (only samples 100 rows)
This approach aligns with pandas 2.0+ default behavior and industry best
practices for datetime parsing at scale.
### BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
<!--- Skip this if not applicable -->
**Before (Datadog logs):**
```
WARNING | superset.utils.core:1698 | UserWarning: Could not infer format, so
each element will be parsed individually, falling back to `dateutil`. To ensure
parsing is consistent and as-expected, please specify a format.
[Repeated hundreds of times per hour]
```
**After:**
```
[No warnings - clean logs]
```
**Performance Comparison (10k rows):**
```
Before: 4.9ms (element-by-element parsing)
After: 0.9ms (vectorized parsing with format detection)
Speedup: 5.4x
```
### TESTING INSTRUCTIONS
<!--- Required! What steps can be taken to manually verify the changes? -->
1. **Run the comprehensive test suite:**
```bash
pytest tests/unit_tests/utils/test_date_parsing.py -v
```
This includes:
- Format detection tests
- Warning suppression verification
- Performance comparisons
- Edge case handling
2. **Manual testing with sample data:**
```python
import pandas as pd
from superset.utils.core import normalize_dttm_col, DateColumn
# Test with consistent format (should be fast, no warnings)
df = pd.DataFrame({
"date": ["2023-01-01", "2023-01-02", "2023-01-03"]
})
normalize_dttm_col(df, (DateColumn(col_label="date"),))
# Test with mixed formats (should suppress warnings)
df = pd.DataFrame({
"date": ["2023-01-01", "01/02/2023", "March 3, 2023"]
})
normalize_dttm_col(df, (DateColumn(col_label="date"),))
```
3. **Verify in a running Superset instance:**
- Create a chart with datetime columns
- Check logs for absence of "Could not infer format" warnings
- Verify dates are parsed correctly
4. **Check existing functionality:**
- Epoch timestamps still work: `DateColumn(timestamp_format="epoch_s")`
- Explicit formats still work: `DateColumn(timestamp_format="%Y-%m-%d")`
- Timezone offsets still applied correctly
### ADDITIONAL INFORMATION
<!--- Check any relevant boxes with "x" -->
<!--- HINT: Include "Fixes #nnn" if you are fixing an existing issue -->
- [ ] Has associated issue:
- [ ] Required feature flags:
- [ ] Changes UI
- [ ] Includes DB Migration (follow approval process in
[SIP-59](https://github.com/apache/superset/issues/13351))
- [ ] Migration is atomic, supports rollback & is backwards-compatible
- [ ] Confirm DB migration upgrade and downgrade tested
- [ ] Runtime estimates and downtime expectations provided
- [ ] Introduces new feature or API
- [ ] Removes existing feature or API
**Notes:**
- No breaking changes - all existing functionality preserved
- Follows similar approach to pandas 2.0+ built-in behavior
- Aligns with Superset's SIP-15A proposal for datetime format inference
- All pre-commit hooks pass (mypy, ruff, pylint)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]