liurenjie1024 commented on issue #32:
URL: https://github.com/apache/iceberg-rust/issues/32#issuecomment-1697308710

   > Regarding the representation of partition values which relates to the use 
of transform. I'm wondering whether the result of a transform should be a 
scalar value. There should only be one partition value for each partition which 
is stored in the datafile. And the partition values are not written to disk.
   
   Yeah, I can understand that it's a little difficult to understand why we 
should implement on arrow. I will use a simplified partition writer to explain 
this:
   ```rust
   struct PartitionWriter {
     transform_func: Box<dyn TransformFunction>,
     parquet_writers: HashMap<PartitionValue, ParquetWriter>
   }
   
   impl PartitionWriter {
     fn append(record_batch: RecordBatch)  {
        let partition_values = 
transform_func.apply(record_batch.get(0)).map(|arrow_row| 
to_iceberg_partition_value(arrow_row))?; // We assume partition function should 
apply on the first column
       
        for i in 0..record_batch.row_count() {
         
parquet_writers.get(partition_values.get(i)).append(record_batch.rows(i));
   }
        
     }
   }
   ```
   
   With this approach, we can compute partiton values in a vectorized approach. 
For more detailed implementation, please refer to  
https://github.com/icelake-io/icelake/blob/aeba46c62932b482340d34550289a1b3a9f30486/icelake/src/io/task_writer.rs#L175


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to