Re: [PR] [WIP] Add Data Files from Parquet Files [iceberg-python]

via GitHub Thu, 14 Mar 2024 06:33:17 -0700


syun64 commented on code in PR #506:
URL: https://github.com/apache/iceberg-python/pull/506#discussion_r1524886152



##########
pyiceberg/table/__init__.py:
##########
@@ -1147,6 +1150,26 @@ def overwrite(self, df: pa.Table, overwrite_filter: 
BooleanExpression = ALWAYS_T
                     for data_file in data_files:
                         update_snapshot.append_data_file(data_file)
 
+    def add_files(self, file_paths: List[str]) -> None:
+        """
+        Shorthand API for adding files as data files to the table.
+
+        Args:
+            file_paths: The list of full file paths to be added as data files 
to the table
+        """
+        if any(not isinstance(field.transform, IdentityTransform) for field in 
self.metadata.spec().fields):
+            raise NotImplementedError("Cannot add_files to a table with 
Transform Partitions")
+
+        if self.name_mapping() is None:

Review Comment:
   @Fokko Yeah I think you are right!
   
   When field IDs are in the files, and the name_mapping is also present, the 
field_ids take precedence over the name_mapping in schema resolution. So the 
name_mapping here would essentially be meaningless in that case.
   
   I'm on the fence between moving forward with your suggestion (create 
name_mapping if there are no field_ids) or whether we should always assert that 
the parquet files that we want to add have _no field IDs_. And that's because 
the field_ids that we actually use in our Iceberg generated parquet files, is 
the Iceberg Table's internal notion of field IDs. Whenever a new table gets 
created, new field IDs are assigned, and Iceberg keeps track of these field IDs 
internally to ensure that the same field can be treated the same through column 
renaming.
   
   When we add_files, we are introducing files that have been produced by an 
external process to Iceberg, which isn't aware of Iceberg's internal fields 
metadata. In that sense, I feel that allowing files that have field_ids to be 
added could result in unexpected errors for the user that are difficult to 
diagnose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [WIP] Add Data Files from Parquet Files [iceberg-python]

Reply via email to