sl255051 commented on PR #10953: URL: https://github.com/apache/iceberg/pull/10953#issuecomment-2307968000
Thank you, Eduard, for your help. I will be on vacation next week. I will pick this up again when I return on September 3. -Steve Lessard From: Eduard Tudenhoefner ***@***.***> Date: Thursday, August 22, 2024 at 3:46 AM To: apache/iceberg ***@***.***> Cc: Lessard, Steve ***@***.***>, Mention ***@***.***> Subject: [EXTERNAL] Re: [apache/iceberg] DRAFT - Issue 10275 - Reward support for nulls (PR #10953) [CAUTION: External Email] @nastra commented on this pull request. @sl255051<https://github.com/sl255051> I spent some time understanding the problem and going through the code and fixing it myself (without first looking at this PR). I think you're on the right track here to handle the root cause and we should have a NullAccessor that internally tracks a NullVector. I would suggest to squash all the commits to a single one and then go through my comments ________________________________ In arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java<https://github.com/apache/iceberg/pull/10953#discussion_r1726810395>: > @Override public VectorHolder read(VectorHolder reuse, int numValsToRead) { - return VectorHolder.dummyHolder(numValsToRead); + NullVector vector = new NullVector(icebergField().name(), numValsToRead); + return new VectorHolder.NullVectorHolder(vector, icebergField(), numValsToRead); there are places in the codebase that relied on ConstantVectorHolder so I'm not sure whether we should introduce a separate NullVectorHolder. I tried it with the following diff, which should work too: --- a/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java +++ b/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java @@ -19,6 +19,7 @@ package org.apache.iceberg.arrow.vectorized; import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; import org.apache.iceberg.MetadataColumns; import org.apache.iceberg.relocated.com.google.common.base.Preconditions; import org.apache.iceberg.types.Type; @@ -138,16 +139,19 @@ public class VectorHolder { public static class ConstantVectorHolder<T> extends VectorHolder { private final T constantValue; private final int numRows; + private final FieldVector vector; public ConstantVectorHolder(int numRows) { this.numRows = numRows; this.constantValue = null; + this.vector = new NullVector("_dummy_", numRows); } public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) { super(icebergField); this.numRows = numRows; this.constantValue = constantValue; + this.vector = new NullVector(icebergField.name(), numRows); } @Override @@ -158,6 +162,11 @@ public class VectorHolder { public Object getConstant() { return constantValue; } + + @Override + public FieldVector vector() { + return vector; + } ________________________________ In arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java<https://github.com/apache/iceberg/pull/10953#discussion_r1726812044>: > @@ -131,6 +131,20 @@ public boolean isDummy() { return vector == null; } + public static class NullVectorHolder extends VectorHolder { see my other comment on whether we should actually introduce this class vs modify ConstantVectorHolderto hold a NullVector ________________________________ In arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java<https://github.com/apache/iceberg/pull/10953#discussion_r1726814320>: > @@ -451,10 +452,6 @@ public String toString() { return columnDescriptor.toString(); } - public static VectorizedArrowReader nulls() { I don't think we should remove this method. How about rather doing this: --- a/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java +++ b/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java @@ -455,6 +455,10 @@ public class VectorizedArrowReader implements VectorizedReader<VectorHolder> { return NullVectorReader.INSTANCE; } + public static VectorizedArrowReader nulls(Types.NestedField icebergField) { + return new NullVectorReader(icebergField); + } + public static VectorizedArrowReader positions() { return new PositionVectorReader(false); } @@ -464,11 +468,15 @@ public class VectorizedArrowReader implements VectorizedReader<VectorHolder> { } private static final class NullVectorReader extends VectorizedArrowReader { - private static final NullVectorReader INSTANCE = new NullVectorReader(); + private static final NullVectorReader INSTANCE = new NullVectorReader(null); + + private NullVectorReader(Types.NestedField icebergField) { + super(icebergField); + } @Override public VectorHolder read(VectorHolder reuse, int numValsToRead) { - return VectorHolder.dummyHolder(numValsToRead); + return new VectorHolder.ConstantVectorHolder<>(icebergField(), numValsToRead, null); } ________________________________ In arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java<https://github.com/apache/iceberg/pull/10953#discussion_r1726816658>: > @@ -97,7 +98,7 @@ public VectorizedReader<?> message( } else if (reader != null) { reorderedFields.add(reader); } else { - reorderedFields.add(VectorizedArrowReader.nulls()); based on my other comments, this should most likely call VectorizedArrowReader.nulls(field) ________________________________ In arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java<https://github.com/apache/iceberg/pull/10953#discussion_r1726817036>: > @@ -19,7 +19,7 @@ package org.apache.iceberg.arrow.vectorized; import static org.apache.iceberg.Files.localInput; -import static org.assertj.core.api.Assertions.assertThat; +import static org.assertj.core.api.Assertions.*; we don't support * imports ________________________________ In arrow/src/test/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactoryTest.java<https://github.com/apache/iceberg/pull/10953#discussion_r1726819029>: > +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.when; + +import java.math.BigDecimal; +import java.util.function.Supplier; +import org.apache.arrow.vector.IntVector; +import org.apache.iceberg.types.Types; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.schema.PrimitiveType; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.mockito.InjectMocks; +import org.mockito.Mock; +import org.mockito.MockitoAnnotations; + +class GenericArrowVectorAccessorFactoryTest { I'm not sure it's worth adding this test, unless you want to handle all vectors (which can also be a follow-up PR) — Reply to this email directly, view it on GitHub<https://github.com/apache/iceberg/pull/10953#pullrequestreview-2254160521>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BJP6DB7V72VBKNBACG2Z3F3ZSW6P5AVCNFSM6AAAAABMUT7GKSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENJUGE3DANJSGE>. You are receiving this because you were mentioned.Message ID: ***@***.***> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org