Re: [PR] Add Refs metadata table [iceberg-python]

via GitHub Sun, 14 Apr 2024 23:31:05 -0700


geruh commented on code in PR #602:
URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1565238973



##########
pyiceberg/table/__init__.py:
##########
@@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) 
-> pa.StructType:
             schema=entries_schema,
         )
 
+    def refs(self) -> "pa.Table":
+        import pyarrow as pa
+
+        ref_schema = pa.schema([
+            pa.field('name', pa.string(), nullable=False),
+            pa.field('type', pa.string(), nullable=False),

Review Comment:
   Hey Fokko, I looked into this for both pandas and pyarrow and it offers a 
few benefits like reduced memory usage and improved performance on sorting and 
filtering. Since we're only dealing with `branch` or `tag`, this is ideal. 
However, this does stray from the Java implementation which uses a [string 
type](https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/core/src/main/java/org/apache/iceberg/RefsTable.java#L36).
 But I think we should be fine with this difference, because in both pyarrow 
and pandas, this type seems to be well supported and can be easily converted to 
string if needed. Also, this would be beneficial for huge tables with many 
references. 
   
   In PyArrow, this would be implemented as a dictionary type mapping integers 
to strings. Also, this could be added to the other metadata tables such as the 
operation field in the snapshots table.



##########
pyiceberg/table/__init__.py:
##########
@@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) 
-> pa.StructType:
             schema=entries_schema,
         )
 
+    def refs(self) -> "pa.Table":
+        import pyarrow as pa
+
+        ref_schema = pa.schema([
+            pa.field('name', pa.string(), nullable=False),
+            pa.field('type', pa.string(), nullable=False),

Review Comment:
   Hey Fokko, I looked into this for both pandas and pyarrow and it offers a 
few benefits like reduced memory usage and improved performance on sorting and 
filtering. Since we're only dealing with `branch` or `tag`, this is ideal. 
However, this does stray from the Java implementation which uses a [string 
type](https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/core/src/main/java/org/apache/iceberg/RefsTable.java#L36).
 But I think we should be fine with this difference, because in both pyarrow 
and pandas, this type seems to be well supported and can be easily converted to 
string if needed. Also, this would be beneficial for huge tables with many 
references. 
   
   In pyarrow, this would be implemented as a dictionary type mapping integers 
to strings. Also, this could be added to the other metadata tables such as the 
operation field in the snapshots table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add Refs metadata table [iceberg-python]

Reply via email to