RussellSpitzer commented on PR #13880:
URL: https://github.com/apache/iceberg/pull/13880#issuecomment-3225900958

   I figured it out:
   
   In VectorizedArrowReader
   
   ```
     private void allocateFieldVector(boolean dictionaryEncodedVector) {
       if (dictionaryEncodedVector) {
         allocateDictEncodedVector();
       } else {
         Field arrowField = 
ArrowSchemaUtil.convert(getPhysicalType(columnDescriptor, icebergField));
         if (columnDescriptor.getPrimitiveType().getOriginalType() != null) {
           
allocateVectorBasedOnOriginalType(columnDescriptor.getPrimitiveType(), 
arrowField);
         } else {
           allocateVectorBasedOnTypeName(columnDescriptor.getPrimitiveType(), 
arrowField);
         }
       }
     }
     ```
   
   Makes the assumption that all pages will have the same encoding. This is a 
big problem if the first page is dictionary encoded and the following ones are 
not. The first pass by this function will call 
   
   allocateDictEncodedVector()
   
   Which does this
   
   ```
       this.vec = field.createVector(rootAlloc);
       ((IntVector) vec).allocateNew(batchSize);
       ```
       
       But what happens if we hten read a non-dictionary encoded page? We will 
then go down the other path, AllocateVectorBasedOnOriginalType
       
       And hit this
       
       ```
        switch (primitive.getOriginalType()) {
         case ENUM:
         case JSON:
         case UTF8:
         case BSON:
           this.vec = arrowField.createVector(rootAlloc);
           // TODO: Possibly use the uncompressed page size info to set the 
initial capacity
           vec.setInitialCapacity(batchSize * 
AVERAGE_VARIABLE_WIDTH_RECORD_SIZE);
           vec.allocateNewSafe();
           this.readType = ReadType.VARCHAR;
           this.typeWidth = UNKNOWN_WIDTH;
           break;
       ```
       
       Which will create a new vector for `this.vec` causing us to lose our 
first vector.
       
       This is easy enough to fix, in both of these functions we just need to 
clear out "this.vec" if it is set


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to