Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

via GitHub Mon, 29 Jan 2024 01:34:13 -0800


Fokko commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469288671



##########
pyiceberg/io/pyarrow.py:
##########
@@ -761,6 +761,32 @@ def primitive(self, primitive: pa.DataType) -> T:
         """Visit a primitive type."""
 
 
+class PreOrderPyArrowSchemaVisitor(Generic[T], ABC):

Review Comment:
   ```suggestion
   class _PreOrderPyArrowSchemaVisitor(Generic[T], ABC):
   ```



##########
pyiceberg/schema.py:
##########
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
     """Traverses the schema and assigns monotonically increasing ids."""
 
-    reserved_ids: Dict[int, int]
+    old_id_to_new_id: Dict[int, int]
 
     def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> 
None:
-        self.reserved_ids = {}
+        self.old_id_to_new_id = {}
         counter = itertools.count(1)
         self.next_id_func = next_id_func if next_id_func is not None else 
lambda: next(counter)
 
-    def _get_and_increment(self) -> int:
-        return self.next_id_func()
+    def _get_and_increment(self, current_id: int) -> int:
+        new_id = self.next_id_func()
+        self.old_id_to_new_id[current_id] = new_id
+        return new_id
 
     def schema(self, schema: Schema, struct_result: Callable[[], StructType]) 
-> Schema:
-        # First we keep the original identifier_field_ids here, we remap 
afterwards
-        fields = struct_result().fields
-        return Schema(*fields, 
identifier_field_ids=[self.reserved_ids[field_id] for field_id in 
schema.identifier_field_ids])
+        return Schema(
+            *struct_result().fields,
+            identifier_field_ids=[self.old_id_to_new_id[field_id] for field_id 
in schema.identifier_field_ids],
+        )
 
     def struct(self, struct: StructType, field_results: List[Callable[[], 
IcebergType]]) -> StructType:
-        # assign IDs for this struct's fields first
-        self.reserved_ids.update({field.field_id: self._get_and_increment() 
for field in struct.fields})
-        return StructType(*[field() for field in field_results])
+        new_ids = [self._get_and_increment(field.field_id) for field in 
struct.fields]
+        new_fields = []
+        for field_id, field, field_type in zip(new_ids, struct.fields, 
field_results):
+            new_fields.append(
+                NestedField(
+                    field_id=field_id,
+                    name=field.name,
+                    field_type=field_type(),
+                    required=field.required,
+                    doc=field.doc,
+                )

Review Comment:
   I think it is cleaner to keep this part in the `field` method, since it 
creates a field, and the `field()` now doesn't return a field, but a type.



##########
pyiceberg/io/pyarrow.py:
##########
@@ -906,6 +932,21 @@ def after_map_value(self, element: pa.Field) -> None:
         self._field_names.pop()
 
 
+class _ConvertToIcebergWithNoIds(_ConvertToIceberg):

Review Comment:
   Style suggestion, feel free to ignore:
   ```suggestion
   class _ConvertToIcebergWithoutIDs(_ConvertToIceberg):
   ```



##########
pyiceberg/io/pyarrow.py:
##########
@@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None:
         self._field_names.pop()
 
 
+class 
_ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, 
Schema]]):

Review Comment:
   I also noticed that the Python implementation differs from the Java side. I 
don't know what the historical reason for this is. I couldn't find anything on 
the original PR about why this was done that way 
https://github.com/apache/iceberg/pull/5627. I'm okay with aligning this with 
the Java implementation.
   
   > Overall, this approach sounds reasonable to me if we can find an easy way 
to refactor the _SetFreshIds. @Fokko, I'd appreciate your perspective on 
refactoring _SetFreshIds. Do you see any issues with this approach?
   
   That's an okay approach, as long as the visitor to do this is hidden inside 
the package. We should not expose setting `-1` field IDs to the outside.
   
   What I like about the current implementation is that the visitor can be used 
on its own. Converting a field with all `-1` IDs doesn't provide much value on 
its own.
   
   I would love to get this in with the 0.6.0 release to simplify the creation 
of tables.



##########
pyiceberg/schema.py:
##########
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
     """Traverses the schema and assigns monotonically increasing ids."""
 
-    reserved_ids: Dict[int, int]
+    old_id_to_new_id: Dict[int, int]

Review Comment:
   I think at some point we just want to align this with Java. But let's do 
that in a separate PR: 
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/AssignFreshIds.java



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

Reply via email to