liurenjie1024 commented on issue #32:
URL: https://github.com/apache/iceberg-rust/issues/32#issuecomment-1697308710
> Regarding the representation of partition values which relates to the use
of transform. I'm wondering whether the result of a transform should be a
scalar value. There should only be one partition value for each partition which
is stored in the datafile. And the partition values are not written to disk.
Yeah, I can understand that it's a little difficult to understand why we
should implement on arrow. I will use a simplified partition writer to explain
this:
```rust
struct PartitionWriter {
transform_func: Box<dyn TransformFunction>,
parquet_writers: HashMap<PartitionValue, ParquetWriter>
}
impl PartitionWriter {
fn append(record_batch: RecordBatch) {
let partition_values =
transform_func.apply(record_batch.get(0)).map(|arrow_row|
to_iceberg_partition_value(arrow_row))?; // We assume partition function should
apply on the first column
for i in 0..record_batch.row_count() {
parquet_writers.get(partition_values.get(i)).append(record_batch.rows(i));
}
}
}
```
With this approach, we can compute partiton values in a vectorized approach.
For more detailed implementation, please refer to
https://github.com/icelake-io/icelake/blob/aeba46c62932b482340d34550289a1b3a9f30486/icelake/src/io/task_writer.rs#L175
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]