[I] [Python][Parquet][Windows] Using as_py() on pyarrow.lib.Timestamp extremely slow [arrow]

via GitHub Tue, 25 Feb 2025 14:01:31 -0800


theodop opened a new issue, #45630:
URL: https://github.com/apache/arrow/issues/45630


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   We are reading data from a Parquet file (as part of a Delta Lake table) 
using Dataset.read_batches(). The data includes a column in "timestamp" format 
in Parquet.
   
   We read using the default settings and map to objects, like so:
   
   ```python
   result_list = []
   
   for batch in dataset.to_batches(
       columns=["StartDate", "ChildObjectName", "Generation"]
   ):
       row_list = zip(*batch.columns)
       for row in row_list:
           result_list.append(dict(start_date=row[0].as_py(), 
child_object_name=row[1].as_py(), generation=row[2].as_py()))
   ```
   
   However, this was taking over 20 seconds per ~100,000 batch read size.
   
   Eventually we narrowed it down to `row[0].as_py()`, and substituted 
`datetime.fromtimestamp(row[0].value / 1_000_000)`, which brought this to under 
a second per batch.
   
   My hunch is that the conversion is either looking for the existence of 
Pandas, or perhaps that this is a side effect of Windows not having a timezone 
"UTC" defined (these are UTC timestamps) and Python falling back to the pytz 
database each time.
   
   ### Component(s)
   
   Python, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Python][Parquet][Windows] Using as_py() on pyarrow.lib.Timestamp extremely slow [arrow]

Reply via email to