nssalian opened a new pull request, #16292:
URL: https://github.com/apache/iceberg/pull/16292

   Follow up to https://github.com/apache/iceberg/pull/16087 - fixing the 
Vectorized support for variant to remove the temporary patches.
   
   ## Rationale for this Change
   
   Variant columns currently force the entire table into row-at-a-time reads 
because the vectorized reader doesn't handle them. This PR fixes that by 
reading variant's metadata and value children as Arrow VarBinary batches.
                                                                                
                                 
    ## What changes are included in this PR?
     
     - `VectorizedReaderBuilder` - adds `variantVisitor()` that creates a 
`VectorizedVariantVisitor` scoped to each variant column's Parquet path       
     - `VectorizedVariantVisitor` - walks variant's internal structure, creates 
Arrow readers for metadata + value leaves
     - `VectorizedArrowReader.VectorizedVariantReader` - composes two child 
readers, delegates                                        
     `read`/`setRowGroupInfo`/`setBatchSize`/`close`                            
                                                                        
     - `VectorHolder.VariantVectorHolder` - carries both child holders through 
the batch pipeline                                     
     - `VariantColumnVector` (new) - Spark `ColumnVector` implementing 
`getChild(0)` = value, `getChild(1)` = metadata per Spark's `getVariant()`      
 
     contract                                                                   
                                                                        
     - `ColumnVectorBuilder` - dispatches `VariantVectorHolder` before 
`isDummy()` check                                                               
 
     - `SparkBatch` - allows variant through the batch reads check              
                                                                 
     - Tests - removed `assumeThat(vectorized).isFalse()` guards; all variant 
read tests now run with vectorization enabled                             
     - Both Spark 4.0 and 4.1                                                   
                                                                        
                                                                                
                                                                        
   ## Not covered (follow-up)                                                   
                                                                  
     - Shredded variant fields are not read in vectorized mode.                 
                             
     - Variant inside structs/lists/maps still falls back to row-at-a-time 
(pre-existing limitation for all complex types).                             
                                                                                
                                                                        
    ## Are these changes tested?                                                
                                                                             
     - `TestSparkVariantRead` (v4.0 + v4.1) - all tests now run with both 
`vectorized=false` and `vectorized=true`      
     
    ## Are there any user-facing changes?
   - Enabling vectorization will run for variant columns after this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to