RussellSpitzer commented on PR #13880:
URL: https://github.com/apache/iceberg/pull/13880#issuecomment-3225900958
I figured it out:
In VectorizedArrowReader
```
private void allocateFieldVector(boolean dictionaryEncodedVector) {
if (dictionaryEncodedVector) {
allocateDictEncodedVector();
} else {
Field arrowField =
ArrowSchemaUtil.convert(getPhysicalType(columnDescriptor, icebergField));
if (columnDescriptor.getPrimitiveType().getOriginalType() != null) {
allocateVectorBasedOnOriginalType(columnDescriptor.getPrimitiveType(),
arrowField);
} else {
allocateVectorBasedOnTypeName(columnDescriptor.getPrimitiveType(),
arrowField);
}
}
}
```
Makes the assumption that all pages will have the same encoding. This is a
big problem if the first page is dictionary encoded and the following ones are
not. The first pass by this function will call
allocateDictEncodedVector()
Which does this
```
this.vec = field.createVector(rootAlloc);
((IntVector) vec).allocateNew(batchSize);
```
But what happens if we hten read a non-dictionary encoded page? We will
then go down the other path, AllocateVectorBasedOnOriginalType
And hit this
```
switch (primitive.getOriginalType()) {
case ENUM:
case JSON:
case UTF8:
case BSON:
this.vec = arrowField.createVector(rootAlloc);
// TODO: Possibly use the uncompressed page size info to set the
initial capacity
vec.setInitialCapacity(batchSize *
AVERAGE_VARIABLE_WIDTH_RECORD_SIZE);
vec.allocateNewSafe();
this.readType = ReadType.VARCHAR;
this.typeWidth = UNKNOWN_WIDTH;
break;
```
Which will create a new vector for `this.vec` causing us to lose our
first vector.
This is easy enough to fix, in both of these functions we just need to
clear out "this.vec" if it is set
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]