brokenjacobs opened a new issue, #43574:
URL: https://github.com/apache/arrow/issues/43574
### Describe the bug, including details regarding any error messages,
version, and platform.
In pyarrow 17.0.0
When accessing a parquet file using parquet.read_parquet an incompatible
types exception is thrown:
```
>>>
pa.parquet.read_table('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py",
line 1793, in read_table
dataset = ParquetDataset(
^^^^^^^^^^^^^^^
File
"/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py",
line 1371, in __init__
self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line
794, in dataset
return _filesystem_dataset(source, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line
486, in _filesystem_dataset
return factory.finish(schema)
^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 3089, in
pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has
incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>
```
But accessing via dataset works:
```
>>> import pyarrow.dataset as ds
>>> df =
ds.dataset('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet').to_table().to_pandas()
df
>>> df
source_id site_id readout_time voltage
0 9319 SJER 2023-01-02 00:00:00+00:00 0.000159
1 9319 SJER 2023-01-02 00:00:01+00:00 0.000159
2 9319 SJER 2023-01-02 00:00:02+00:00 0.000160
3 9319 SJER 2023-01-02 00:00:03+00:00 0.000159
4 9319 SJER 2023-01-02 00:00:04+00:00 0.000157
... ... ... ... ...
86395 9319 SJER 2023-01-02 23:59:55+00:00 0.000049
86396 9319 SJER 2023-01-02 23:59:56+00:00 0.000048
86397 9319 SJER 2023-01-02 23:59:57+00:00 0.000049
86398 9319 SJER 2023-01-02 23:59:58+00:00 0.000048
86399 9319 SJER 2023-01-02 23:59:59+00:00 0.000048
[86400 rows x 4 columns]
>>>
```
When I revert to pyarrow 16.1.0 both methods work:
```
>>> t =
pa.parquet.read_table('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
>>> t.to_pandas()
source_id site_id readout_time voltage
0 9319 SJER 2023-01-02 00:00:00+00:00 0.000159
1 9319 SJER 2023-01-02 00:00:01+00:00 0.000159
2 9319 SJER 2023-01-02 00:00:02+00:00 0.000160
3 9319 SJER 2023-01-02 00:00:03+00:00 0.000159
4 9319 SJER 2023-01-02 00:00:04+00:00 0.000157
... ... ... ... ...
86395 9319 SJER 2023-01-02 23:59:55+00:00 0.000049
86396 9319 SJER 2023-01-02 23:59:56+00:00 0.000048
86397 9319 SJER 2023-01-02 23:59:57+00:00 0.000049
86398 9319 SJER 2023-01-02 23:59:58+00:00 0.000048
86399 9319 SJER 2023-01-02 23:59:59+00:00 0.000048
[86400 rows x 4 columns]
```
I've tried using the fs implementation to list the bucket in 17.0.0 and that
works fine, I have no idea what is wrong here:
```
>>> from pyarrow import fs
>>> gcs = fs.GcsFileSystem()
>>> file_list =
gcs.get_file_info(fs.FileSelector('***t/v1/li191r/ms=2023-01/source_id=9319/',
recursive=False))
>>> file_list
[<FileInfo for
'***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-01.parquet':
type=FileType.File, size=418556>, <FileInfo for
'***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet':
type=FileType.File, size=401198>, (and so on) ]
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]