tmontes opened a new issue, #44352: URL: https://github.com/apache/arrow/issues/44352
### Describe the bug, including details regarding any error messages, version, and platform. Hi Arrow team, thanks for sharing such a powerful and fundamental data handling lib! :) I'm failing to read a hive partitioned parquet dataset when the partition columns have a leading underscore in their names, using the latest Pandas 2.2.3 + PyArrow 17.0.0 combination. I admit I might be doing something wrong, but found nothing to guide me after browsing the docs, searching the web, and even asking a few LLMs around (!!!)... The fact is that other tools, like **duckdb** which I also use often, have no issue reading the same dataset. REPRODUCTION: ``` import pathlib import tempfile import pandas as pd import pyarrow.dataset as ds YEAR_COLUMN = '_year' FILE_COLUMN = '_file' with tempfile.TemporaryDirectory() as td: dataset_path = pathlib.Path(td) / 'dataset' # create parquet dataset partitioned by YEAR_COLUMN / FILE_COLUMN pd.DataFrame([ {'data': 0, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'}, {'data': 1, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'}, {'data': 2, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'}, {'data': 4, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'}, {'data': 5, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'}, {'data': 6, YEAR_COLUMN: 2021, FILE_COLUMN: 'b'}, {'data': 7, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'}, {'data': 8, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'}, ]).to_parquet( dataset_path, partition_cols=[YEAR_COLUMN, FILE_COLUMN], index=False, ) # get dataset row_count for a given FILE_COLUMN value: 'a' in this case dataset = ds.dataset( dataset_path, partitioning=ds.partitioning(flavor='hive') ) row_count_for_file_a = sum( batch.num_rows for batch in dataset.to_batches( columns=[YEAR_COLUMN], filter=(ds.field(FILE_COLUMN) == 'a') ) ) assert row_count_for_file_a == 2 ``` FAILURE: ``` $ python x.py Traceback (most recent call last): File ".../x.py", line 39, in <module> for batch in dataset.to_batches( ^^^^^^^^^^^^^^^^^^^ File "pyarrow/_dataset.pyx", line 475, in pyarrow._dataset.Dataset.to_batches File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.scanner File "pyarrow/_dataset.pyx", line 3557, in pyarrow._dataset.Scanner.from_dataset File "pyarrow/_dataset.pyx", line 3475, in pyarrow._dataset.Scanner._make_scan_options File "pyarrow/_dataset.pyx", line 3409, in pyarrow._dataset._populate_builder File "pyarrow/_compute.pyx", line 2724, in pyarrow._compute._bind File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(_file) in ``` MORE: * Removing the leading underscore from either of the partitioning columns still fails. * Only with no leading underscores in any of the partitioning columns does the code work. | | `YEAR_COLUMN='_year'` | `YEAR_COLUMN='year'` | |----|-----------------------|-------------------------| | **`FILE_COLUMN='_file'`** | No match for FieldRef.Name(_file) | No match for FieldRef.Name(_file) | | **`FILE_COLUMN='file'`** | No match for FieldRef.Name(file) | works | LASTLY: * A consistent observation is that a Pandas `pd.read_parquet` of said dataset returns an empty dataframe, I suspect precisely due to the same underlying motives. QUESTION: * I found no docs stating that leading underscores in hive-partition column names are invalid: maybe I missed them. * Could this be a bug? Or am I coding it wrong? Thanks for the Arrow project and any insight/assistance on this. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org