syun64 opened a new pull request, #506:
URL: https://github.com/apache/iceberg-python/pull/506

   PyIceberg's version of add_files Spark migration procedure.
   
   Some early ideas on its implementation:
   - instead of staying with the input interface for Spark's Procedure, we 
could just allow the users to pass a list of full file_paths instead
   - current implementation infers the partition values from the path and 
doesn't validate if the files themselves have the partition values. We could 
instead of statistics from the parquet metadata to check the min and max values 
of the partition columns (min and max should be the same for a partition 
column) and use that value to derive the partition record value instead. If the 
statistic is present, this would be more accurate than inferring the value 
through string match on the partition path
   - only Identity Transforms are currently supported. This is because in order 
to construct the manifest entries for the data files from the partition path, 
we need to convert human string values to their respective internal partition 
representations that get encoded as the partition values in the avro files. 
This is challenging to do for the Transform partitions, since the we will need 
to create a reverse transformation of the human string to partition 
representation for every supported type of IcebergType+Transform pairs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to