liurenjie1024 commented on code in PR #1821:
URL: https://github.com/apache/iceberg-rust/pull/1821#discussion_r2517717820
##########
crates/iceberg/src/arrow/record_batch_transformer.rs:
##########
@@ -107,11 +142,101 @@ enum SchemaComparison {
Different,
}
+/// Builder for RecordBatchTransformer to improve ergonomics when constructing
with optional parameters.
+///
+/// See [`RecordBatchTransformer`] for details on partition spec and partition
data.
+#[derive(Debug)]
+pub(crate) struct RecordBatchTransformerBuilder {
+ snapshot_schema: Arc<IcebergSchema>,
+ projected_iceberg_field_ids: Vec<i32>,
+ partition_spec: Option<Arc<PartitionSpec>>,
+ partition_data: Option<Struct>,
+}
+
+impl RecordBatchTransformerBuilder {
+ pub(crate) fn new(
+ snapshot_schema: Arc<IcebergSchema>,
+ projected_iceberg_field_ids: &[i32],
+ ) -> Self {
+ Self {
+ snapshot_schema,
+ projected_iceberg_field_ids: projected_iceberg_field_ids.to_vec(),
+ partition_spec: None,
+ partition_data: None,
+ }
+ }
+
+ /// Set partition spec and data together for identifying
identity-transformed partition columns.
+ ///
+ /// Both partition_spec and partition_data must be provided together since
the spec defines
+ /// which fields are identity-partitioned, and the data provides their
constant values.
+ /// One without the other cannot produce a valid constants map.
+ pub(crate) fn with_partition(
+ mut self,
+ partition_spec: Option<Arc<PartitionSpec>>,
+ partition_data: Option<Struct>,
Review Comment:
```suggestion
partition_spec: Arc<PartitionSpec>,
partition_data: Struct,
```
Our api requires that they appear together, so we don't need Option here.
##########
crates/iceberg/src/arrow/record_batch_transformer.rs:
##########
Review Comment:
I'm thinking about removing this method after we have a builder.
##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -919,6 +954,77 @@ fn build_fallback_field_id_map(parquet_schema:
&SchemaDescriptor) -> HashMap<i32
column_map
}
+/// Apply name mapping to Arrow schema for Parquet files lacking field IDs.
+///
+/// Assigns Iceberg field IDs based on column names using the name mapping,
+/// enabling correct projection on migrated files (e.g., from Hive/Spark via
add_files).
+///
+/// Per Iceberg spec Column Projection rule #2:
+/// "Use schema.name-mapping.default metadata to map field id to columns
without field id"
+/// https://iceberg.apache.org/spec/#column-projection
+///
+/// Corresponds to Java's ParquetSchemaUtil.applyNameMapping() and
ApplyNameMapping visitor.
+/// The key difference is Java operates on Parquet MessageType, while we
operate on Arrow Schema.
+///
+/// # Arguments
+/// * `arrow_schema` - Arrow schema from Parquet file (without field IDs)
+/// * `name_mapping` - Name mapping from table metadata
(TableProperties.DEFAULT_NAME_MAPPING)
+///
+/// # Returns
+/// Arrow schema with field IDs assigned based on name mapping
+fn apply_name_mapping_to_arrow_schema(
Review Comment:
It would be better to use arrow scheme visitor to do this so that we could
handle nested type in future. We don't have to do it now, but please add an
issue to track it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]