brokenjacobs opened a new issue, #43574: URL: https://github.com/apache/arrow/issues/43574
### Describe the bug, including details regarding any error messages, version, and platform. In pyarrow 17.0.0 When accessing a parquet file using parquet.read_parquet an incompatible types exception is thrown: ``` >>> pa.parquet.read_table('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table dataset = ParquetDataset( ^^^^^^^^^^^^^^^ File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__ self._dataset = ds.dataset(path_or_paths, filesystem=filesystem, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset return _filesystem_dataset(source, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset return factory.finish(schema) ^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0> ``` But accessing via dataset works: ``` >>> import pyarrow.dataset as ds >>> df = ds.dataset('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet').to_table().to_pandas() df >>> df source_id site_id readout_time voltage 0 9319 SJER 2023-01-02 00:00:00+00:00 0.000159 1 9319 SJER 2023-01-02 00:00:01+00:00 0.000159 2 9319 SJER 2023-01-02 00:00:02+00:00 0.000160 3 9319 SJER 2023-01-02 00:00:03+00:00 0.000159 4 9319 SJER 2023-01-02 00:00:04+00:00 0.000157 ... ... ... ... ... 86395 9319 SJER 2023-01-02 23:59:55+00:00 0.000049 86396 9319 SJER 2023-01-02 23:59:56+00:00 0.000048 86397 9319 SJER 2023-01-02 23:59:57+00:00 0.000049 86398 9319 SJER 2023-01-02 23:59:58+00:00 0.000048 86399 9319 SJER 2023-01-02 23:59:59+00:00 0.000048 [86400 rows x 4 columns] >>> ``` When I revert to pyarrow 16.1.0 both methods work: ``` >>> t = pa.parquet.read_table('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet') >>> t.to_pandas() source_id site_id readout_time voltage 0 9319 SJER 2023-01-02 00:00:00+00:00 0.000159 1 9319 SJER 2023-01-02 00:00:01+00:00 0.000159 2 9319 SJER 2023-01-02 00:00:02+00:00 0.000160 3 9319 SJER 2023-01-02 00:00:03+00:00 0.000159 4 9319 SJER 2023-01-02 00:00:04+00:00 0.000157 ... ... ... ... ... 86395 9319 SJER 2023-01-02 23:59:55+00:00 0.000049 86396 9319 SJER 2023-01-02 23:59:56+00:00 0.000048 86397 9319 SJER 2023-01-02 23:59:57+00:00 0.000049 86398 9319 SJER 2023-01-02 23:59:58+00:00 0.000048 86399 9319 SJER 2023-01-02 23:59:59+00:00 0.000048 [86400 rows x 4 columns] ``` I've tried using the fs implementation to list the bucket in 17.0.0 and that works fine, I have no idea what is wrong here: ``` >>> from pyarrow import fs >>> gcs = fs.GcsFileSystem() >>> file_list = gcs.get_file_info(fs.FileSelector('***t/v1/li191r/ms=2023-01/source_id=9319/', recursive=False)) >>> file_list [<FileInfo for '***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-01.parquet': type=FileType.File, size=418556>, <FileInfo for '***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet': type=FileType.File, size=401198>, (and so on) ] ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org