matteosdocsity opened a new issue, #43908:
URL: https://github.com/apache/arrow/issues/43908
### Describe the bug, including details regarding any error messages,
version, and platform.
When using `pyarrow` versions greater than `12.0.1` to write Parquet files
that are then loaded into Google BigQuery, the fields containing `Decimal`
values (used to represent BigQuery's `BIGNUMERIC` type) are being read as
`NULL` by BigQuery. This issue does not occur with `pyarrow==12.0.1`.
#### Environment
- **Python Version:** 3.12
- **pyarrow Version:** 13.0.0 and above (issue observed)
- **Google BigQuery Version:** N/A (BigQuery as the consumer of the Parquet
files)
- **Operating System:** [Your OS, e.g., Ubuntu 20.04, macOS 13, etc.]
#### Steps to Reproduce
1. Create a Pandas DataFrame with a column containing lists of `Decimal`
values.
2. Write the DataFrame to a Parquet file using `pyarrow==13.0.0` or later.
3. Load the Parquet file into a Google BigQuery external table.
4. Query the table in BigQuery.
#### Expected Behavior
BigQuery should correctly read the `Decimal` values from the Parquet file
and populate the corresponding fields in the table.
#### Actual Behavior
BigQuery reads the fields corresponding to `Decimal` values as `NULL`.
#### Code Example
```python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from decimal import Decimal
from google.cloud import storage
import io
# Sample DataFrame
data = {
'sessions_array': [
[{'item': Decimal('12345678901234567890.12345678901234567890')},
{'item': Decimal('23456789012345678901.12345678901234567891')}],
[{'item': Decimal('34567890123456789012.12345678901234567892')}]
]
}
df = pd.DataFrame(data)
# Define PyArrow schema
schema = pa.schema([
pa.field('sessions_array', pa.list_(pa.struct([
pa.field('item', pa.decimal128(38, 18))
])))
])
# Convert DataFrame to PyArrow Table with the defined schema
table = pa.Table.from_pandas(df, schema=schema)
# Write Parquet to buffer
buffer = io.BytesIO()
pq.write_table(table, buffer)
buffer.seek(0)
# Upload buffer to Google Cloud Storage or save locally
with open('/tmp/test.parquet', 'wb') as f:
f.write(buffer.getbuffer())
````
#### Notes
This issue does not occur with pyarrow==12.0.1.
The problem seems to be related to how pyarrow serializes Decimal types in
Parquet files and how BigQuery interprets them.
#### Workaround
Using pyarrow==12.0.1 and Python<3.12 resolves the issue, but this is not
ideal as it requires using an older version of the library, which may lack
other features or bug fixes.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]