phillipleblanc opened a new issue, #1307:
URL: https://github.com/apache/iceberg-rust/issues/1307

   ### Apache Iceberg Rust version
   
   0.4.0 (latest version)
   
   ### Describe the bug
   
   The error shows up as:
    
   ```bash
   Unexpected => Failed to read a Parquet file, source: External: Invalid 
argument error: Invalid comparison operation: LargeUtf8 == Utf8
   ```
    
   Specifically `Invalid comparison operation: LargeUtf8 == Utf8`
    
   The error occurs in a low-level Arrow compute kernel that is designed to 
efficiently compare two arrays. It expects both of those arrays to have the 
exact same physical type, and LargeUtf8 and Utf8 despite logically having the 
same type, are still different physical types.
    
   The reason for the comparison is when we're pushing down a predicate like 
`WHERE audit_log_type = 'sql_query'` into the Iceberg table, and its using this 
predicate to construct a row filter to pass to the Parquet reader to skip 
reading rows that are filtered out.
    
   The `lhs` operator (i.e. the `LargeUtf8`) is the Arrow type of the column 
(`audit_log_type`) that is read from Parquet. The Parquet spec and the Arrow 
spec do not share the same types, so there is a mapping process that needs to 
happen to convert from a Parquet-native type into an Arrow-native type. (This 
[issue](https://github.com/apache/arrow-rs/issues/1666) explains in more detail 
how the mapping works, and some issues that arise from mapping incorrectly.)
    
   There is no "LargeUtf8" type in Parquet, and indeed if we look at the code 
that maps Parquet types into Arrow types, it maps it into Utf8, not LargeUtf8: 
[arrow-rs/parquet/src/arrow/schema/primitive.rs at main · 
apache/arrow-rs](https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/schema/primitive.rs#L272)
    
   `(Some(LogicalType::String), _) => Ok(DataType::Utf8)`
    
   So why are we getting LargeUtf8? Well in order to more faithfully reproduce 
Arrow types from reading Parquet, there is a [standard metadata 
field](https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/mod.rs#L182)
 called `ARROW:schema` that parquet writers can populate which encodes which 
Arrow type was used. And Parquet readers can leverage this to overwrite the 
base Arrow type that it infers. This is what is happening in this case, and we 
can see that overloading logic here: 
[arrow-rs/parquet/src/arrow/schema/primitive.rs at main · 
apache/arrow-rs](https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/schema/primitive.rs#L40)
    
   For the `rhs` operand of the predicate, why is it Utf8? That is set by this 
crate and its a vanilla mapping from the Iceberg String type to the Arrow Utf8 
type (here: [iceberg-rust/crates/iceberg/src/arrow/schema.rs at main · 
apache/iceberg-rust](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/arrow/schema.rs#L671))
    
   The issue then is that we're trying to compare a LargeUtf8 that we read from 
parquet against a Utf8 literal that we computed from the predicate. The Arrow 
compute kernel needs these types to match exactly.
    
   There are two potential fixes. One is to set an option on the Parquet reader 
to disable the type hint from the `ARROW:schema` metadata field and just use 
the base types. I don't think that is the correct fix.
    
   The second fix is to check the type in the predicate evaluation and if the 
type from the Parquet reader differs from the literal, then convert the literal 
type to the Parquet reader type using the Arrow cast kernel first. I have 
implemented this fix.
   
   ### To Reproduce
   
   See test in linked PR
   
   ### Expected behavior
   
   The predicate is evaluated correctly.
   
   ### Willingness to contribute
   
   I can contribute a fix for this bug independently


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to