[I] Error Reading Parquet Data Files Written by Hive/Impala after Migration to Iceberg using Spark SQL [iceberg]

via GitHub Mon, 21 Oct 2024 05:46:34 -0700


dntjr8096 opened a new issue, #11367:
URL: https://github.com/apache/iceberg/issues/11367


   ### Apache Iceberg version
   
   1.4.3
   
   ### Query engine
   
   Impala
   
   ### Please describe the bug 🐞
   
   When migrating a Hive-Parquet table written via Impala or Hive to Iceberg 
using the Spark command CALL catalog.system.migrate('hive_table'), reading the 
data in Spark SQL fails due to schema compatibility issues.
   
   1. Some Parquet-producing systems (e.g., Impala, Hive, older versions of 
Spark SQL) do not differentiate between binary data and strings when writing 
the Parquet schema. Spark SQL offers a flag to interpret binary data as strings 
to provide compatibility with these systems.
   
   2. If data was written to a Hive-Parquet table (e.g., hive_table) using 
Impala, and the table had two columns (col1 as string, and col2 as string), the 
Parquet row groups show null for the logical type:
   ```
   * Table Name: hive_table / parquet row groups info*
   
   column_name: col1
   physical type: binary
   logical type: null
   
   column_name: col2
   physical type: binary
   logical type: null
   ```
   
   3. In Iceberg, when trying to read Parquet data in Spark, the 
`GenericArrowVectorAccessorFactory` generates an accessor as 
`DictionaryBinaryAccessor`. Since `DictionaryBinaryAccessor` does not implement 
the getUTF8String function, reading the data in Spark results in an error.
   ex.
   ```java
   spark.sql("CALL catalog.sytem.migrate('hive_table')")
   spark.sql("select * from hive_table").limit(10).show()
   
   java.lang.UnsupportedOperationException: Unsupported type: UTF8String
        at 
org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:81)
        at 
org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:138)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 source)
        ...
   ```
   
   As an alternative solution, if the PARQUET_ANNOTATE_STRINGS_UTF8 query 
option is enabled in Impala (version 2.6 or higher), the logical type will be 
annotated as string, avoiding the issue. For example:
   ```
   column_name: col1
   physical type: binary
   logical type: string
   
   column_name: col2
   physical type: binary
   logical type: string
   ```
   
   Something like this needed as a fix
   ```
    ...
       } else {
         switch (primitive.getPrimitiveTypeName()) {
           case FIXED_LEN_BYTE_ARRAY:
           case BINARY:
             return new DictionaryBinaryAccessor<>((IntVector) vector, 
dictionary,  stringFactorySupplier.get())
    ...
    ...
     private static class DictionaryBinaryAccessor<
             DecimalT, Utf8StringT, ArrayT, ChildVectorT extends AutoCloseable>
         extends ArrowVectorAccessor<DecimalT, Utf8StringT, ArrayT, 
ChildVectorT> {
       private final IntVector offsetVector;
       private final Dictionary dictionary;
       private final IntVector offsetVector;
       private final Utf8StringT[] cache;
       
       DictionaryBinaryAccessor(IntVector vector, Dictionary dictionary, , 
StringFactory<Utf8StringT> stringFactory) {
         super(vector);
         this.offsetVector = vector;
         this.dictionary = dictionary;
         this.stringFactory = stringFactory;
         this.cache = genericArray(stringFactory.getGenericClass(), 
dictionary.getMaxId() + 1);
       }
   
       @Override
       public final byte[] getBinary(int rowId) {
         return dictionary.decodeToBinary(offsetVector.get(rowId)).getBytes();
       }
       
        @Override
       public final Utf8StringT getUTF8String(int rowId) {
         int offset = offsetVector.get(rowId);
         if (cache[offset] == null) {
           cache[offset] =
               
stringFactory.ofByteBuffer(dictionary.decodeToBinary(offset).toByteBuffer());
         }
         return cache[offset];
       }
     }
   ```
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [X] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Error Reading Parquet Data Files Written by Hive/Impala after Migration to Iceberg using Spark SQL [iceberg]

Reply via email to