CaptainEureka opened a new issue, #2057:
URL: https://github.com/apache/iceberg-python/issues/2057

   ### Apache Iceberg version
   
   0.9.1 (latest release)
   
   ### Please describe the bug 🐞
   
   When attempting to add Parquet files to an Iceberg table using 
`Table.add_files`, the operation fails if a column defined as `DecimalType` in 
the Iceberg schema is physically stored as `FIXED_LEN_BYTE_ARRAY` in the 
Parquet file, *even if* the decimal's precision would typically map to `INT32` 
or `INT64` according to Iceberg's preferred Parquet mapping.
   
   I see in the Iceberg Spec that on-write the mapping is correct. However, the 
current behaviour seems to overly restrict the physical Parquet type for 
decimals during the file addition process. I believe this greatly limits the 
*kinds* of parquet files that can be "added" to an Iceberg table this way.
   
   **Steps to Reproduce:**
   
   1.  Define an Iceberg table schema with a `DecimalType` column, for example, 
`Decimal(10, 2)`.
       *   Iceberg's preferred Parquet physical type for `Decimal(10, 2)` would 
be `INT64`.
   2.  Create a Parquet file where the corresponding column for this 
`Decimal(10, 2)` is physically stored as `FIXED_LEN_BYTE_ARRAY`. The data 
itself is valid for `Decimal(10, 2)`.
   3.  Attempt to add this Parquet file to the Iceberg table using 
`Table.add_files`.
   
   **Behavior:**
   
   The `Table.add_files` operation fails, with the following error:
   
   ```sh
   ValueError: Unexpected physical type FIXED_LEN_BYTE_ARRAY for 
DecimalType(10, 2) expected INT32
   ```
   
   indicating a mismatch between the expected physical type (e.g., `INT64`) and 
the actual physical type (`FIXED_LEN_BYTE_ARRAY`) found in the Parquet file for 
the decimal column.
   
   **Expected Behavior:**
   
   The `Table.add_files` operation should succeed and correctly read the 
decimal values from the `FIXED_LEN_BYTE_ARRAY` physical storage. The Iceberg 
reader/writer should be lenient with the physical storage format of decimals  
OR otherwise `Table.add_files` should document these limitations.
   
   **Environment:**
   
   *   Python version: 3.12.9
   *   Parquet library and version: pyarrow 20.0.0
   
   P.S. If this is just user error and I shouldn't be trying to do things this 
way I'd be happy to hear alternatives.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to