xndai opened a new issue, #16341:
URL: https://github.com/apache/iceberg/issues/16341
### Apache Iceberg version
1.10.1 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
After promoting an integer column to long via schema evolution, reading
Parquet files that have an INT(32, true) logical type annotation with the
vectorized reader throws:
```
java.lang.ClassCastException: class
org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector cannot be cast
to class
org.apache.iceberg.shaded.org.apache.arrow.vector.IntVector
at
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader$LogicalTypeVisitor.visit(VectorizedArrowReader.java:592)
at
org.apache.parquet.schema.LogicalTypeAnnotation$IntLogicalTypeAnnotation.accept(LogicalTypeAnnotation.java:812)
at
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateVectorBasedOnLogicalType(VectorizedArrowReader.java:287)
at
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:239)
at
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:153)
```
Detailed repro steps:
```
tables = new HadoopTables();
Schema schema = new Schema(Types.NestedField.required(1, "col",
Types.IntegerType.get()));
Table table = tables.create(schema, tempDir.toURI() +
"/int-promotion-logical");
// Write a Parquet file with INT(32, signed) logical type annotation.
// This is what non-Iceberg writers (PyArrow, Spark native, etc.)
typically produce.
MessageType parquetSchema =
new MessageType(
"test",
primitive(PrimitiveType.PrimitiveTypeName.INT32,
Type.Repetition.REQUIRED)
.as(LogicalTypeAnnotation.intType(32, true))
.id(1)
.named("col"));
...
// Promote the column type from int to long (simulates ALTER TABLE)
table.updateSchema().updateColumn("col", Types.LongType.get()).commit();
...
// Read with the vectorized reader
int totalRows = 0;
int rowIndex = 0;
try (VectorizedTableScanIterable vectorizedReader =
new VectorizedTableScanIterable(table.newScan(), 1024, false)) {
for (ColumnarBatch batch : vectorizedReader) { // exception
thrown here
...
}
}
...
```
Root cause:
In `VectorizedArrowReader.allocateFieldVector()`, the vector is created from
the Iceberg schema type which is `BigIntVector` after schema evolution. But
then the `LogicalTypeVisitor` casts it based on the Parquet file's logical
type, which is INT(32). This mismatch causes the `ClassCastException`.
To fix this, we would need to create the `FieldVector` based on the actual
parquet data size. The accessor then handles widening to long when the engine
calls getLong().
### Willingness to contribute
- [x] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]