Re: [PR] Implement column projection [iceberg-python]

via GitHub Thu, 19 Dec 2024 07:40:51 -0800


kevinjqliu commented on code in PR #1443:
URL: https://github.com/apache/iceberg-python/pull/1443#discussion_r1892511992



##########
tests/io/test_pyarrow.py:
##########
@@ -1122,6 +1123,110 @@ def test_projection_concat_files(schema_int: Schema, 
file_int: str) -> None:
     assert repr(result_table.schema) == "id: int32"
 
 
+def test_projection_partition_inference(tmp_path: str) -> None:

Review Comment:
   nit: i think it would be a good idea to test the high-level table `scan` API 
instead of `ArrowScan`. 
   
   We would need a data file which does not contain the partition field. To 
achieve that we can do something like this
   1. Create a table with no partitioning
   2. Write to table to produce data files
   3. Add a new column to the table
   4. Add the new column a partition
   5. Read the table again
   
   By step 4, we should achieve the setup we want



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1237,16 +1238,29 @@ def _task_to_record_batches(
         # When V3 support is introduced, we will update 
`downcast_ns_timestamp_to_us` flag based on
         # the table format version.
         file_schema = pyarrow_to_schema(physical_schema, name_mapping, 
downcast_ns_timestamp_to_us=True)
+
+        if file_schema is None:

Review Comment:
   is there a case where `file_schema` is None? 



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1237,16 +1238,29 @@ def _task_to_record_batches(
         # When V3 support is introduced, we will update 
`downcast_ns_timestamp_to_us` flag based on
         # the table format version.
         file_schema = pyarrow_to_schema(physical_schema, name_mapping, 
downcast_ns_timestamp_to_us=True)
+
+        if file_schema is None:
+            raise ValueError(f"Missing Iceberg schema in Metadata for file: 
{path}")
+
         pyarrow_filter = None
         if bound_row_filter is not AlwaysTrue():
             translated_row_filter = translate_column_names(bound_row_filter, 
file_schema, case_sensitive=case_sensitive)
             bound_file_filter = bind(file_schema, translated_row_filter, 
case_sensitive=case_sensitive)
             pyarrow_filter = expression_to_pyarrow(bound_file_filter)
 
+        # Apply column projection rules for missing partitions and default 
values

Review Comment:
   Since there are other rules in the column projection for when values are not 
present, i think it will be a good idea to encapsulate all this in a new helper 
function. 
   The function can further resolve according to the spec and returns (maybe) a 
dictionary to add to the resulting record_batches later.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Implement column projection [iceberg-python]

Reply via email to