tmontes opened a new issue, #44352:
URL: https://github.com/apache/arrow/issues/44352

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hi Arrow team, thanks for sharing such a powerful and fundamental data 
handling lib! :)
   
   
   I'm failing to read a hive partitioned parquet dataset when the partition 
columns have a leading underscore in their names, using the latest Pandas 2.2.3 
+ PyArrow 17.0.0 combination.
   
   I admit I might be doing something wrong, but found nothing to guide me 
after browsing the docs, searching the web, and even asking a few LLMs around 
(!!!)... The fact is that other tools, like **duckdb** which I also use often, 
have no issue reading the same dataset.
   
   REPRODUCTION:
   
   ```
   import pathlib
   import tempfile
   
   import pandas as pd
   import pyarrow.dataset as ds
   
   
   YEAR_COLUMN = '_year'
   FILE_COLUMN = '_file'
   
   
   with tempfile.TemporaryDirectory() as td:
   
       dataset_path = pathlib.Path(td) / 'dataset'
   
       # create parquet dataset partitioned by YEAR_COLUMN / FILE_COLUMN
       pd.DataFrame([
           {'data': 0, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
           {'data': 1, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
           {'data': 2, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
           {'data': 4, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
           {'data': 5, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
           {'data': 6, YEAR_COLUMN: 2021, FILE_COLUMN: 'b'},
           {'data': 7, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
           {'data': 8, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
       ]).to_parquet(
           dataset_path,
           partition_cols=[YEAR_COLUMN, FILE_COLUMN],
           index=False,
       )
   
       # get dataset row_count for a given FILE_COLUMN value: 'a' in this case
       dataset = ds.dataset(
           dataset_path,
           partitioning=ds.partitioning(flavor='hive')
       )
       row_count_for_file_a = sum(
           batch.num_rows
           for batch in dataset.to_batches(
               columns=[YEAR_COLUMN],
               filter=(ds.field(FILE_COLUMN) == 'a')
           )
       )
       assert row_count_for_file_a == 2
   ```
   
   FAILURE:
   
   ```
   $ python x.py
   Traceback (most recent call last):
     File ".../x.py", line 39, in <module>
       for batch in dataset.to_batches(
                    ^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 475, in 
pyarrow._dataset.Dataset.to_batches
     File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.scanner
     File "pyarrow/_dataset.pyx", line 3557, in 
pyarrow._dataset.Scanner.from_dataset
     File "pyarrow/_dataset.pyx", line 3475, in 
pyarrow._dataset.Scanner._make_scan_options
     File "pyarrow/_dataset.pyx", line 3409, in 
pyarrow._dataset._populate_builder
     File "pyarrow/_compute.pyx", line 2724, in pyarrow._compute._bind
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(_file) in
   ```
   
   MORE:
   
   * Removing the leading underscore from either of the partitioning columns 
still fails.
   * Only with no leading underscores in any of the partitioning columns does 
the code work.
   
   
   |     |  `YEAR_COLUMN='_year'`  | `YEAR_COLUMN='year'` |
   |----|-----------------------|-------------------------|
   | **`FILE_COLUMN='_file'`** | No match for FieldRef.Name(_file) | No match 
for FieldRef.Name(_file) |
   | **`FILE_COLUMN='file'`** | No match for FieldRef.Name(file) | works |
   
   LASTLY:
   
   * A consistent observation is that a Pandas `pd.read_parquet` of said 
dataset returns an empty dataframe, I suspect precisely due to the same 
underlying motives.
   
   QUESTION:
   
   * I found no docs stating that leading underscores in hive-partition column 
names are invalid: maybe I missed them.
   * Could this be a bug? Or am I coding it wrong?
   
   Thanks for the Arrow project and any insight/assistance on this.
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to