theodop opened a new issue, #45630: URL: https://github.com/apache/arrow/issues/45630
### Describe the bug, including details regarding any error messages, version, and platform. We are reading data from a Parquet file (as part of a Delta Lake table) using Dataset.read_batches(). The data includes a column in "timestamp" format in Parquet. We read using the default settings and map to objects, like so: ```python result_list = [] for batch in dataset.to_batches( columns=["StartDate", "ChildObjectName", "Generation"] ): row_list = zip(*batch.columns) for row in row_list: result_list.append(dict(start_date=row[0].as_py(), child_object_name=row[1].as_py(), generation=row[2].as_py())) ``` However, this was taking over 20 seconds per ~100,000 batch read size. Eventually we narrowed it down to `row[0].as_py()`, and substituted `datetime.fromtimestamp(row[0].value / 1_000_000)`, which brought this to under a second per batch. My hunch is that the conversion is either looking for the existence of Pandas, or perhaps that this is a side effect of Windows not having a timezone "UTC" defined (these are UTC timestamps) and Python falling back to the pytz database each time. ### Component(s) Python, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org