[I] Integer dictionary bitwidth preservation breaks multi-file read behaviour in pyarrow 20 [arrow]

via GitHub Wed, 28 May 2025 07:44:57 -0700


cjrh opened a new issue, #46629:
URL: https://github.com/apache/arrow/issues/46629


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   This issue was first posed as a question in #30302. I repeat the text here 
for convenience.
   
   # Overview
   
   I have a directory of parquet files. For a specific categorical column, some 
parquet files use int8 and some use int16. In pyarrow 19.0.1, reading the 
directory as a dataset succeeds. But with pyarrow 20, it fails with the below 
error when loading data from the dataset directory
   
   # Reader code (python)
   
   Either
   
   ```python
           import pandas as pd
           df = pd.read_parquet(
               path,
               engine="pyarrow",
           )
   ```
   
   or
   
   ```
       import pyarrow.dataset as dataset
       dataset = dataset.dataset(path, format="parquet")
       table = dataset.to_table()
       df = table.to_pandas()
   ```
   
   # Traceback
   
     ...
     File "/app/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", 
line 1475, in read
       table = self._dataset.to_table(
               ^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 589, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3941, in 
pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Integer value 731 not in range: -128 to 127
   
   # Parquet Metadata
   
   This shows the dictionary info of the parquet files in the directory:
   
   >>> import pyarrow.dataset as dataset
   >>> ds = dataset(path)
   >>> for path in ds.files:
   ...     sch = pq.read_schema(path)
   ...     print(path, sch.field('ExpStartDate').type)
   ... 
   dataframes.parq/00eac90ef2f504223a74498405e060a48.parquet 
dictionary<values=string, indices=int8, ordered=0>
   dataframes.parq/0641c30f725cd448bafc335d36cd01f6b.parquet 
dictionary<values=string, indices=int16, ordered=0>
   dataframes.parq/0cb2799478dd54c738efe76fdc1875326.parquet 
dictionary<values=string, indices=int8, ordered=0>
   dataframes.parq/0cff477be69be4ee093d98728d4f84452.parquet 
dictionary<values=string, indices=int16, ordered=0>
   dataframes.parq/0d103de6323904e93aecf24589c12a370.parquet 
dictionary<values=string, indices=int8, ordered=0>
   
   Is my issue related to the change in #30302 ? Is there a way to restore the 
previous behaviour of upcasting to int32 on read? Or what is the preferred 
workaround? It is going to be quite tedious to have to force all my writes to 
use int32, and especially for migrating huge volumes of historical data. For 
now we remain on pyarrow 19.0.1, but at some point we would like to upgrade.
   
   ### Component(s)
   
   C++, Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Integer dictionary bitwidth preservation breaks multi-file read behaviour in pyarrow 20 [arrow]

Reply via email to