[I] iter_batches on a parquet file with zero row groups fails in pyarrow>=18 [arrow]

via GitHub Thu, 12 Dec 2024 02:35:38 -0800


cedriccuypers opened a new issue, #45009:
URL: https://github.com/apache/arrow/issues/45009


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   We noticed a bug in pyarrow when we were trying to iterate in batches over 
parquet files, of which some had zero row groups (output of the AWS S3 
Inventory service).
   
   Code snippet to reproduce the issue:
   
   ```
   import fastparquet
   import pandas as pd
   import pyarrow
   import pyarrow.parquet as pq
   
   print(f"Using pyarrow version {pyarrow.__version__}")
   
   df = pd.DataFrame({"a": pd.Series(dtype="int"), "b": pd.Series(dtype="str"), 
"c": pd.Series(dtype="float")})
   
   empty_parquet_file_path = "my_empty_parquet_file.parquet"
   
   fastparquet.write(empty_parquet_file_path, df, row_group_offsets=[])
   
   assert pq.read_metadata(empty_parquet_file_path).num_row_groups == 0
   
   parquet_file = pq.ParquetFile(empty_parquet_file_path)
   
   for batch in parquet_file.iter_batches():
       print(batch)
   ```
   
   The following error is raised when using pyarrow 18.0.0 or 18.1.0. 
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/_parquet.pyx", line 1634, in iter_batches
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   OSError: The file only has 0 row groups, requested metadata for row group: -1
   ```
   
   In pyarrow 17, there is no issue, and an empty parquet file doesn't seem to 
produce any batches when calling iter_batches, which is the behaviour I would 
expect.
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] iter_batches on a parquet file with zero row groups fails in pyarrow>=18 [arrow]

Reply via email to