harperjiang opened a new issue, #16502:
URL: https://github.com/apache/iceberg/issues/16502
### Apache Iceberg version
main (development)
### Query engine
Spark
### Please describe the bug 🐞
## Issue Summary
When the vectorized Arrow reader is used to read a v3 Iceberg table that has
a `decimal` column carrying an `initialDefault` or `writeDefault`, vector
allocation fails with:
```
java.lang.IllegalArgumentException: Cannot cast default value to FIXED[9]:
12345.6789
at org.apache.iceberg.types.Types$NestedField.castDefault(Types.java:892)
at org.apache.iceberg.types.Types$NestedField.<init>(Types.java:881)
at org.apache.iceberg.types.Types$NestedField$Builder.build(Types.java:850)
at
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.getPhysicalType(VectorizedArrowReader.java:255)
at
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:228)
at
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:151)
```
The message varies with the underlying Parquet physical encoding:
- `FIXED_LEN_BYTE_ARRAY`-backed decimal → `Cannot cast default value to
fixed[N]: <default>`
Same read path with vectorization disabled has no errors:
```
spark.sql.iceberg.vectorization.enabled=false
```
## Repro
1. Create a v3 Iceberg table with a decimal column that has a default value:
```sql
CREATE TABLE local.db.t (
id INT,
amount DECIMAL(5, 2) DEFAULT 0.00
) USING iceberg TBLPROPERTIES ('format-version' = '3');
INSERT INTO local.db.t VALUES (1, 1.23), (2, 4.56), (3, 7.89);
```
2. Read with vectorization enabled (the default):
```sql
SET spark.sql.iceberg.vectorization.enabled=true;
SELECT * FROM local.db.t;
```
The query fails with the stack trace above. The failure is deterministic
only when the column is not dictionary-encoded; with dictionary encoding,
allocation goes through `allocateDictEncodedVector` and bypasses the buggy
path, so small/highly-repetitive data sets may appear to read successfully.
## Root cause
`VectorizedArrowReader#getPhysicalType` rewrites a decimal Iceberg field to
its underlying physical type (`int` / `long` / `fixed[N]`) so the right Arrow
vector class can be allocated:
```java
physicalType = Types.NestedField.from(logicalType).ofType(type).build();
```
`Types.NestedField.Builder.from(field)` copies the field's `initialDefault`
and `writeDefault` onto the builder. `NestedField`'s constructor then calls
`castDefault(literal, type)` against the new physical type — for a decimal
default this delegates to `DecimalLiteral.to(LongType | IntegerType |
FixedType)`, which is undefined and returns `null`, tripping the
`Preconditions.checkArgument` in `castDefault`.
Conceptually, the defaults belong to the logical (decimal) view of the
column and should not flow to the physical representation — the physical type
is an internal detail used only to size the Arrow vector. The non-vectorized
readers (`BaseParquetReaders`, `SparkParquetReaders`, `FlinkParquetReaders`)
all apply defaults at the logical-type layer and are unaffected.
Proposed PR for the fix: https://github.com/apache/iceberg/pull/16501
### Willingness to contribute
- [x] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]