[I] NullPointerException when using VectorizedArrowReader to read a null column [iceberg]

via GitHub Mon, 06 May 2024 09:12:23 -0700


slessard opened a new issue, #10275:
URL: https://github.com/apache/iceberg/issues/10275


   ### Apache Iceberg version
   
   1.5.1 (latest release)
   
   ### Query engine
   
   Other
   
   ### Please describe the bug 🐞
   
   I am writing a compatibility layer for Teradata so that it can access 
Iceberg tables stored in AWS S3. I am experiencing what at first glance appears 
to be a bug in Iceberg, but I'd like to get the opinion of the experts here. To 
be clear I am using Apache Iceberg 1.5.1 and Apache Arrow 15.0.0.
   
   The problem is I am getting a NullPointerException thrown from 
[GenericArrowVectorFactory.java line 
224](https://github.com/apache/iceberg/blob/cbb853073e681b4075d7c8707610dceecbee3a82/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java#L224).
 The NPE is thrown because on line 224 because `vector` is null.
   ```
       throw new UnsupportedOperationException("Unsupported vector: " + 
vector.getClass());
   ```
   How do I get to this point? Here's the minimal test case:
   
   **Prerequisite:**
   ```
   create table otf920ath (
        a INT NOT NULL,
        b string(10),
        c decimal(12, 3)
   )
   LOCATION 's3://*******************'
   TBLPROPERTIES ('table_type' = 'ICEBERG');
   
   INSERT INTO otf920ath values (1, 'san diego', 1024.025);
   
   ALTER TABLE otf920ath
     ADD COLUMNS (a1 int); 
   ```
   **repro:**
   ```
   select * from otf920ath;
   ```
   The above SQL select statement works in AWS Athena, but fails in my code. My 
code is using an instance of 
`org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator`
   
   The cause, as I see it, is that the one row in the table contains only three 
columns worth of data, but the current table schema defines four columns. 
Because of this difference in schemas Iceberg creates the following four 
readers, once for each column respectively:
   `VecorizedArrowReader` corresponding to column `a`
   `VecorizedArrowReader` corresponding to column `b`
   `VecorizedArrowReader` corresponding to column `c`
   `VecorizedArrowReader$NullVectorReader` corresponding to column `a1`
   
   Naturally the `VecorizedArrowReader$NullVectorReader` instance contains a 
`null` value for the vector. This instance is assigned at 
[VectorizedReaderBuilder.java line 
100](https://github.com/apache/iceberg/blob/cbb853073e681b4075d7c8707610dceecbee3a82/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java#L100).
   
   Continuing down the code path Iceberg calls 
`GenericArrowVectorAccessorFactory.getPlainVectorAccessor`. This method checks 
to see whether `vector` is an instance of various *Vector types. Because 
`vector` has a value of `null` it is not an instance of any type. Thus this 
method ends up in its ultimate fallback case and tries to throw an exception:
   ```
   throw new UnsupportedOperationException("Unsupported vector: " + 
vector.getClass());
   ```
   The problem is that `vector` is `null` and this calling `vector.getClass()` 
throws a `NullPointerException`.
   
   The stack trace is:
   ```
   java.lang.NullPointerException
        at 
org.apache.iceberg.arrow.vectorized.GenericArrowVectorAccessorFactory.getPlainVectorAccessor(GenericArrowVectorAccessorFactory.java:224)
        at 
org.apache.iceberg.arrow.vectorized.GenericArrowVectorAccessorFactory.getVectorAccessor(GenericArrowVectorAccessorFactory.java:110)
        at 
org.apache.iceberg.arrow.vectorized.ArrowVectorAccessors.getVectorAccessor(ArrowVectorAccessors.java:54)
        at 
org.apache.iceberg.arrow.vectorized.ColumnVector.getVectorAccessor(ColumnVector.java:136)
        at 
org.apache.iceberg.arrow.vectorized.ColumnVector.<init>(ColumnVector.java:56)
        at 
org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:54)
        at 
org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:29)
        at 
org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:149)
        at 
org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:314)
        at 
org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:190)
   ```
   
   **So my questions:**
   1. Is it possible that this is a bug in Iceberg?
   2. If so, is the fix simply to handle the `null` value for `vector` when 
building the message for the UnsupportedOperationException?
   3. If not, is there some other code path or method arguments I should be 
using?
   
   
   p.s. I asked this question in the Slack channel but didn't get any traction. 
https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1714676216273989
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] NullPointerException when using VectorizedArrowReader to read a null column [iceberg]

Reply via email to