tmontes opened a new issue, #44352:
URL: https://github.com/apache/arrow/issues/44352
### Describe the bug, including details regarding any error messages,
version, and platform.
Hi Arrow team, thanks for sharing such a powerful and fundamental data
handling lib! :)
I'm failing to read a hive partitioned parquet dataset when the partition
columns have a leading underscore in their names, using the latest Pandas 2.2.3
+ PyArrow 17.0.0 combination.
I admit I might be doing something wrong, but found nothing to guide me
after browsing the docs, searching the web, and even asking a few LLMs around
(!!!)... The fact is that other tools, like **duckdb** which I also use often,
have no issue reading the same dataset.
REPRODUCTION:
```
import pathlib
import tempfile
import pandas as pd
import pyarrow.dataset as ds
YEAR_COLUMN = '_year'
FILE_COLUMN = '_file'
with tempfile.TemporaryDirectory() as td:
dataset_path = pathlib.Path(td) / 'dataset'
# create parquet dataset partitioned by YEAR_COLUMN / FILE_COLUMN
pd.DataFrame([
{'data': 0, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
{'data': 1, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
{'data': 2, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
{'data': 4, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
{'data': 5, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
{'data': 6, YEAR_COLUMN: 2021, FILE_COLUMN: 'b'},
{'data': 7, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
{'data': 8, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
]).to_parquet(
dataset_path,
partition_cols=[YEAR_COLUMN, FILE_COLUMN],
index=False,
)
# get dataset row_count for a given FILE_COLUMN value: 'a' in this case
dataset = ds.dataset(
dataset_path,
partitioning=ds.partitioning(flavor='hive')
)
row_count_for_file_a = sum(
batch.num_rows
for batch in dataset.to_batches(
columns=[YEAR_COLUMN],
filter=(ds.field(FILE_COLUMN) == 'a')
)
)
assert row_count_for_file_a == 2
```
FAILURE:
```
$ python x.py
Traceback (most recent call last):
File ".../x.py", line 39, in <module>
for batch in dataset.to_batches(
^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 475, in
pyarrow._dataset.Dataset.to_batches
File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.scanner
File "pyarrow/_dataset.pyx", line 3557, in
pyarrow._dataset.Scanner.from_dataset
File "pyarrow/_dataset.pyx", line 3475, in
pyarrow._dataset.Scanner._make_scan_options
File "pyarrow/_dataset.pyx", line 3409, in
pyarrow._dataset._populate_builder
File "pyarrow/_compute.pyx", line 2724, in pyarrow._compute._bind
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(_file) in
```
MORE:
* Removing the leading underscore from either of the partitioning columns
still fails.
* Only with no leading underscores in any of the partitioning columns does
the code work.
| | `YEAR_COLUMN='_year'` | `YEAR_COLUMN='year'` |
|----|-----------------------|-------------------------|
| **`FILE_COLUMN='_file'`** | No match for FieldRef.Name(_file) | No match
for FieldRef.Name(_file) |
| **`FILE_COLUMN='file'`** | No match for FieldRef.Name(file) | works |
LASTLY:
* A consistent observation is that a Pandas `pd.read_parquet` of said
dataset returns an empty dataframe, I suspect precisely due to the same
underlying motives.
QUESTION:
* I found no docs stating that leading underscores in hive-partition column
names are invalid: maybe I missed them.
* Could this be a bug? Or am I coding it wrong?
Thanks for the Arrow project and any insight/assistance on this.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]