Re: [PR] feat(reader): Add PartitionSpec support to FileScanTask and RecordBatchTransformer [iceberg-rust]

via GitHub Wed, 12 Nov 2025 02:26:23 -0800


liurenjie1024 commented on code in PR #1821:
URL: https://github.com/apache/iceberg-rust/pull/1821#discussion_r2517717820



##########
crates/iceberg/src/arrow/record_batch_transformer.rs:
##########
@@ -107,11 +142,101 @@ enum SchemaComparison {
     Different,
 }
 
+/// Builder for RecordBatchTransformer to improve ergonomics when constructing 
with optional parameters.
+///
+/// See [`RecordBatchTransformer`] for details on partition spec and partition 
data.
+#[derive(Debug)]
+pub(crate) struct RecordBatchTransformerBuilder {
+    snapshot_schema: Arc<IcebergSchema>,
+    projected_iceberg_field_ids: Vec<i32>,
+    partition_spec: Option<Arc<PartitionSpec>>,
+    partition_data: Option<Struct>,
+}
+
+impl RecordBatchTransformerBuilder {
+    pub(crate) fn new(
+        snapshot_schema: Arc<IcebergSchema>,
+        projected_iceberg_field_ids: &[i32],
+    ) -> Self {
+        Self {
+            snapshot_schema,
+            projected_iceberg_field_ids: projected_iceberg_field_ids.to_vec(),
+            partition_spec: None,
+            partition_data: None,
+        }
+    }
+
+    /// Set partition spec and data together for identifying 
identity-transformed partition columns.
+    ///
+    /// Both partition_spec and partition_data must be provided together since 
the spec defines
+    /// which fields are identity-partitioned, and the data provides their 
constant values.
+    /// One without the other cannot produce a valid constants map.
+    pub(crate) fn with_partition(
+        mut self,
+        partition_spec: Option<Arc<PartitionSpec>>,
+        partition_data: Option<Struct>,

Review Comment:
   ```suggestion
           partition_spec: Arc<PartitionSpec>,
           partition_data: Struct,
   ```
   
   Our api requires that they appear together, so we don't need Option here.



##########
crates/iceberg/src/arrow/record_batch_transformer.rs:
##########


Review Comment:
   I'm thinking about removing this method after we have a builder.



##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -919,6 +954,77 @@ fn build_fallback_field_id_map(parquet_schema: 
&SchemaDescriptor) -> HashMap<i32
     column_map
 }
 
+/// Apply name mapping to Arrow schema for Parquet files lacking field IDs.
+///
+/// Assigns Iceberg field IDs based on column names using the name mapping,
+/// enabling correct projection on migrated files (e.g., from Hive/Spark via 
add_files).
+///
+/// Per Iceberg spec Column Projection rule #2:
+/// "Use schema.name-mapping.default metadata to map field id to columns 
without field id"
+/// https://iceberg.apache.org/spec/#column-projection
+///
+/// Corresponds to Java's ParquetSchemaUtil.applyNameMapping() and 
ApplyNameMapping visitor.
+/// The key difference is Java operates on Parquet MessageType, while we 
operate on Arrow Schema.
+///
+/// # Arguments
+/// * `arrow_schema` - Arrow schema from Parquet file (without field IDs)
+/// * `name_mapping` - Name mapping from table metadata 
(TableProperties.DEFAULT_NAME_MAPPING)
+///
+/// # Returns
+/// Arrow schema with field IDs assigned based on name mapping
+fn apply_name_mapping_to_arrow_schema(

Review Comment:
   It would be better to use arrow scheme visitor to do this so that we could 
handle nested type in future. We don't have to do it now, but please add an 
issue to track it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(reader): Add PartitionSpec support to FileScanTask and RecordBatchTransformer [iceberg-rust]

Reply via email to