cedriccuypers opened a new issue, #45009:
URL: https://github.com/apache/arrow/issues/45009
### Describe the bug, including details regarding any error messages,
version, and platform.
We noticed a bug in pyarrow when we were trying to iterate in batches over
parquet files, of which some had zero row groups (output of the AWS S3
Inventory service).
Code snippet to reproduce the issue:
```
import fastparquet
import pandas as pd
import pyarrow
import pyarrow.parquet as pq
print(f"Using pyarrow version {pyarrow.__version__}")
df = pd.DataFrame({"a": pd.Series(dtype="int"), "b": pd.Series(dtype="str"),
"c": pd.Series(dtype="float")})
empty_parquet_file_path = "my_empty_parquet_file.parquet"
fastparquet.write(empty_parquet_file_path, df, row_group_offsets=[])
assert pq.read_metadata(empty_parquet_file_path).num_row_groups == 0
parquet_file = pq.ParquetFile(empty_parquet_file_path)
for batch in parquet_file.iter_batches():
print(batch)
```
The following error is raised when using pyarrow 18.0.0 or 18.1.0.
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/_parquet.pyx", line 1634, in iter_batches
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: The file only has 0 row groups, requested metadata for row group: -1
```
In pyarrow 17, there is no issue, and an empty parquet file doesn't seem to
produce any batches when calling iter_batches, which is the behaviour I would
expect.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]