[I] [Python] Accessing parquet files with read_parquet in google cloud storage fails, but works with dataset [arrow]

via GitHub Mon, 05 Aug 2024 16:24:46 -0700


brokenjacobs opened a new issue, #43574:
URL: https://github.com/apache/arrow/issues/43574


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   In pyarrow 17.0.0
   
   When accessing a parquet file using parquet.read_parquet an incompatible 
types exception is thrown:
   ```
   >>> 
pa.parquet.read_table('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File 
"/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", 
line 1793, in read_table
       dataset = ParquetDataset(
                 ^^^^^^^^^^^^^^^
     File 
"/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", 
line 1371, in __init__
       self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 
794, in dataset
       return _filesystem_dataset(source, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 
486, in _filesystem_dataset
       return factory.finish(schema)
              ^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 3089, in 
pyarrow._dataset.DatasetFactory.finish
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has 
incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>
   ```
   
   But accessing via dataset works:
   ```
   >>> import pyarrow.dataset as ds
   >>> df = 
ds.dataset('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet').to_table().to_pandas()
   df
   >>> df
         source_id site_id              readout_time   voltage
   0          9319    SJER 2023-01-02 00:00:00+00:00  0.000159
   1          9319    SJER 2023-01-02 00:00:01+00:00  0.000159
   2          9319    SJER 2023-01-02 00:00:02+00:00  0.000160
   3          9319    SJER 2023-01-02 00:00:03+00:00  0.000159
   4          9319    SJER 2023-01-02 00:00:04+00:00  0.000157
   ...         ...     ...                       ...       ...
   86395      9319    SJER 2023-01-02 23:59:55+00:00  0.000049
   86396      9319    SJER 2023-01-02 23:59:56+00:00  0.000048
   86397      9319    SJER 2023-01-02 23:59:57+00:00  0.000049
   86398      9319    SJER 2023-01-02 23:59:58+00:00  0.000048
   86399      9319    SJER 2023-01-02 23:59:59+00:00  0.000048
   
   [86400 rows x 4 columns]
   >>>
   ```
   
   When I revert to pyarrow 16.1.0 both methods work:
   ```
   >>> t = 
pa.parquet.read_table('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
   >>> t.to_pandas()
         source_id site_id              readout_time   voltage
   0          9319    SJER 2023-01-02 00:00:00+00:00  0.000159
   1          9319    SJER 2023-01-02 00:00:01+00:00  0.000159
   2          9319    SJER 2023-01-02 00:00:02+00:00  0.000160
   3          9319    SJER 2023-01-02 00:00:03+00:00  0.000159
   4          9319    SJER 2023-01-02 00:00:04+00:00  0.000157
   ...         ...     ...                       ...       ...
   86395      9319    SJER 2023-01-02 23:59:55+00:00  0.000049
   86396      9319    SJER 2023-01-02 23:59:56+00:00  0.000048
   86397      9319    SJER 2023-01-02 23:59:57+00:00  0.000049
   86398      9319    SJER 2023-01-02 23:59:58+00:00  0.000048
   86399      9319    SJER 2023-01-02 23:59:59+00:00  0.000048
   
   [86400 rows x 4 columns]
   ```
   
   I've tried using the fs implementation to list the bucket in 17.0.0 and that 
works fine, I have no idea what is wrong here:
   ```
   >>> from pyarrow import fs
   >>> gcs = fs.GcsFileSystem()
   >>> file_list = 
gcs.get_file_info(fs.FileSelector('***t/v1/li191r/ms=2023-01/source_id=9319/', 
recursive=False))
   >>> file_list
   [<FileInfo for 
'***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-01.parquet': 
type=FileType.File, size=418556>, <FileInfo for 
'***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet': 
type=FileType.File, size=401198>,  (and so on) ]
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Python] Accessing parquet files with read_parquet in google cloud storage fails, but works with dataset [arrow]

Reply via email to