dntjr8096 opened a new issue, #11367: URL: https://github.com/apache/iceberg/issues/11367
### Apache Iceberg version 1.4.3 ### Query engine Impala ### Please describe the bug 🐞 When migrating a Hive-Parquet table written via Impala or Hive to Iceberg using the Spark command CALL catalog.system.migrate('hive_table'), reading the data in Spark SQL fails due to schema compatibility issues. 1. Some Parquet-producing systems (e.g., Impala, Hive, older versions of Spark SQL) do not differentiate between binary data and strings when writing the Parquet schema. Spark SQL offers a flag to interpret binary data as strings to provide compatibility with these systems. 2. If data was written to a Hive-Parquet table (e.g., hive_table) using Impala, and the table had two columns (col1 as string, and col2 as string), the Parquet row groups show null for the logical type: ``` * Table Name: hive_table / parquet row groups info* column_name: col1 physical type: binary logical type: null column_name: col2 physical type: binary logical type: null ``` 3. In Iceberg, when trying to read Parquet data in Spark, the `GenericArrowVectorAccessorFactory` generates an accessor as `DictionaryBinaryAccessor`. Since `DictionaryBinaryAccessor` does not implement the getUTF8String function, reading the data in Spark results in an error. ex. ```java spark.sql("CALL catalog.sytem.migrate('hive_table')") spark.sql("select * from hive_table").limit(10).show() java.lang.UnsupportedOperationException: Unsupported type: UTF8String at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:81) at org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:138) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown source) ... ``` As an alternative solution, if the PARQUET_ANNOTATE_STRINGS_UTF8 query option is enabled in Impala (version 2.6 or higher), the logical type will be annotated as string, avoiding the issue. For example: ``` column_name: col1 physical type: binary logical type: string column_name: col2 physical type: binary logical type: string ``` Something like this needed as a fix ``` ... } else { switch (primitive.getPrimitiveTypeName()) { case FIXED_LEN_BYTE_ARRAY: case BINARY: return new DictionaryBinaryAccessor<>((IntVector) vector, dictionary, stringFactorySupplier.get()) ... ... private static class DictionaryBinaryAccessor< DecimalT, Utf8StringT, ArrayT, ChildVectorT extends AutoCloseable> extends ArrowVectorAccessor<DecimalT, Utf8StringT, ArrayT, ChildVectorT> { private final IntVector offsetVector; private final Dictionary dictionary; private final IntVector offsetVector; private final Utf8StringT[] cache; DictionaryBinaryAccessor(IntVector vector, Dictionary dictionary, , StringFactory<Utf8StringT> stringFactory) { super(vector); this.offsetVector = vector; this.dictionary = dictionary; this.stringFactory = stringFactory; this.cache = genericArray(stringFactory.getGenericClass(), dictionary.getMaxId() + 1); } @Override public final byte[] getBinary(int rowId) { return dictionary.decodeToBinary(offsetVector.get(rowId)).getBytes(); } @Override public final Utf8StringT getUTF8String(int rowId) { int offset = offsetVector.get(rowId); if (cache[offset] == null) { cache[offset] = stringFactory.ofByteBuffer(dictionary.decodeToBinary(offset).toByteBuffer()); } return cache[offset]; } } ``` ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [X] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org