phillipleblanc opened a new issue, #1307: URL: https://github.com/apache/iceberg-rust/issues/1307
### Apache Iceberg Rust version 0.4.0 (latest version) ### Describe the bug The error shows up as: ```bash Unexpected => Failed to read a Parquet file, source: External: Invalid argument error: Invalid comparison operation: LargeUtf8 == Utf8 ``` Specifically `Invalid comparison operation: LargeUtf8 == Utf8` The error occurs in a low-level Arrow compute kernel that is designed to efficiently compare two arrays. It expects both of those arrays to have the exact same physical type, and LargeUtf8 and Utf8 despite logically having the same type, are still different physical types. The reason for the comparison is when we're pushing down a predicate like `WHERE audit_log_type = 'sql_query'` into the Iceberg table, and its using this predicate to construct a row filter to pass to the Parquet reader to skip reading rows that are filtered out. The `lhs` operator (i.e. the `LargeUtf8`) is the Arrow type of the column (`audit_log_type`) that is read from Parquet. The Parquet spec and the Arrow spec do not share the same types, so there is a mapping process that needs to happen to convert from a Parquet-native type into an Arrow-native type. (This [issue](https://github.com/apache/arrow-rs/issues/1666) explains in more detail how the mapping works, and some issues that arise from mapping incorrectly.) There is no "LargeUtf8" type in Parquet, and indeed if we look at the code that maps Parquet types into Arrow types, it maps it into Utf8, not LargeUtf8: [arrow-rs/parquet/src/arrow/schema/primitive.rs at main · apache/arrow-rs](https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/schema/primitive.rs#L272) `(Some(LogicalType::String), _) => Ok(DataType::Utf8)` So why are we getting LargeUtf8? Well in order to more faithfully reproduce Arrow types from reading Parquet, there is a [standard metadata field](https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/mod.rs#L182) called `ARROW:schema` that parquet writers can populate which encodes which Arrow type was used. And Parquet readers can leverage this to overwrite the base Arrow type that it infers. This is what is happening in this case, and we can see that overloading logic here: [arrow-rs/parquet/src/arrow/schema/primitive.rs at main · apache/arrow-rs](https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/schema/primitive.rs#L40) For the `rhs` operand of the predicate, why is it Utf8? That is set by this crate and its a vanilla mapping from the Iceberg String type to the Arrow Utf8 type (here: [iceberg-rust/crates/iceberg/src/arrow/schema.rs at main · apache/iceberg-rust](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/arrow/schema.rs#L671)) The issue then is that we're trying to compare a LargeUtf8 that we read from parquet against a Utf8 literal that we computed from the predicate. The Arrow compute kernel needs these types to match exactly. There are two potential fixes. One is to set an option on the Parquet reader to disable the type hint from the `ARROW:schema` metadata field and just use the base types. I don't think that is the correct fix. The second fix is to check the type in the predicate evaluation and if the type from the Parquet reader differs from the literal, then convert the literal type to the Parquet reader type using the Arrow cast kernel first. I have implemented this fix. ### To Reproduce See test in linked PR ### Expected behavior The predicate is evaluated correctly. ### Willingness to contribute I can contribute a fix for this bug independently -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org