Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

via GitHub Mon, 29 Jan 2024 06:30:30 -0800


syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469674294



##########
pyiceberg/schema.py:
##########
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
     """Traverses the schema and assigns monotonically increasing ids."""
 
-    reserved_ids: Dict[int, int]
+    old_id_to_new_id: Dict[int, int]
 
     def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> 
None:
-        self.reserved_ids = {}
+        self.old_id_to_new_id = {}
         counter = itertools.count(1)
         self.next_id_func = next_id_func if next_id_func is not None else 
lambda: next(counter)
 
-    def _get_and_increment(self) -> int:
-        return self.next_id_func()
+    def _get_and_increment(self, current_id: int) -> int:
+        new_id = self.next_id_func()
+        self.old_id_to_new_id[current_id] = new_id
+        return new_id
 
     def schema(self, schema: Schema, struct_result: Callable[[], StructType]) 
-> Schema:
-        # First we keep the original identifier_field_ids here, we remap 
afterwards
-        fields = struct_result().fields
-        return Schema(*fields, 
identifier_field_ids=[self.reserved_ids[field_id] for field_id in 
schema.identifier_field_ids])
+        return Schema(
+            *struct_result().fields,
+            identifier_field_ids=[self.old_id_to_new_id[field_id] for field_id 
in schema.identifier_field_ids],
+        )
 
     def struct(self, struct: StructType, field_results: List[Callable[[], 
IcebergType]]) -> StructType:
-        # assign IDs for this struct's fields first
-        self.reserved_ids.update({field.field_id: self._get_and_increment() 
for field in struct.fields})
-        return StructType(*[field() for field in field_results])
+        new_ids = [self._get_and_increment(field.field_id) for field in 
struct.fields]
+        new_fields = []
+        for field_id, field, field_type in zip(new_ids, struct.fields, 
field_results):
+            new_fields.append(
+                NestedField(
+                    field_id=field_id,
+                    name=field.name,
+                    field_type=field_type(),
+                    required=field.required,
+                    doc=field.doc,
+                )

Review Comment:
   Yes, I realized that too...
   
   I think the refactored behavior actually [mirrors the Java 
code](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/AssignFreshIds.java#L102)
 more closely now in how it takes the Type, and then creates the NestedField 
with the result of the field level Callable within the struct function call in 
order to assign the IDs first to the fields, and then to its children. 
   
   I think preserving this approach allows us to avoid introducing the new 
constraint of requiring the field IDs to be unique, and limits that restriction 
to just the [correctness in the task of refreshing the 
identifier_field_ids](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/TypeUtil.java#L262),
 which requires referring to the field Name by index as well as by name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

Reply via email to