[I] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 [arrow]

via GitHub Mon, 02 Sep 2024 06:55:07 -0700


matteosdocsity opened a new issue, #43908:
URL: https://github.com/apache/arrow/issues/43908


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When using `pyarrow` versions greater than `12.0.1` to write Parquet files 
that are then loaded into Google BigQuery, the fields containing `Decimal` 
values (used to represent BigQuery's `BIGNUMERIC` type) are being read as 
`NULL` by BigQuery. This issue does not occur with `pyarrow==12.0.1`.
   
   #### Environment
   
   - **Python Version:** 3.12
   - **pyarrow Version:** 13.0.0 and above (issue observed)
   - **Google BigQuery Version:** N/A (BigQuery as the consumer of the Parquet 
files)
   - **Operating System:** [Your OS, e.g., Ubuntu 20.04, macOS 13, etc.]
   
   #### Steps to Reproduce
   
   1. Create a Pandas DataFrame with a column containing lists of `Decimal` 
values.
   2. Write the DataFrame to a Parquet file using `pyarrow==13.0.0` or later.
   3. Load the Parquet file into a Google BigQuery external table.
   4. Query the table in BigQuery.
   
   #### Expected Behavior
   
   BigQuery should correctly read the `Decimal` values from the Parquet file 
and populate the corresponding fields in the table.
   
   #### Actual Behavior
   
   BigQuery reads the fields corresponding to `Decimal` values as `NULL`.
   
   #### Code Example
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   from decimal import Decimal
   from google.cloud import storage
   import io
   
   # Sample DataFrame
   data = {
       'sessions_array': [
           [{'item': Decimal('12345678901234567890.12345678901234567890')}, 
{'item': Decimal('23456789012345678901.12345678901234567891')}],
           [{'item': Decimal('34567890123456789012.12345678901234567892')}]
       ]
   }
   
   df = pd.DataFrame(data)
   
   # Define PyArrow schema
   schema = pa.schema([
       pa.field('sessions_array', pa.list_(pa.struct([
           pa.field('item', pa.decimal128(38, 18))
       ])))
   ])
   
   # Convert DataFrame to PyArrow Table with the defined schema
   table = pa.Table.from_pandas(df, schema=schema)
   
   # Write Parquet to buffer
   buffer = io.BytesIO()
   pq.write_table(table, buffer)
   buffer.seek(0)
   
   # Upload buffer to Google Cloud Storage or save locally
   with open('/tmp/test.parquet', 'wb') as f:
       f.write(buffer.getbuffer())
   ````
   
   ####  Notes
   This issue does not occur with pyarrow==12.0.1.
   The problem seems to be related to how pyarrow serializes Decimal types in 
Parquet files and how BigQuery interprets them.
   ####  Workaround
   Using pyarrow==12.0.1 and Python<3.12 resolves the issue, but this is not 
ideal as it requires using an older version of the library, which may lack 
other features or bug fixes.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 [arrow]

Reply via email to