Fokko commented on PR #506: URL: https://github.com/apache/iceberg-python/pull/506#issuecomment-1994769768
So both of the approaches have pro's and con's. One thing I would like to avoid is having to rely on Hive directly, this will make sure that we can generalize it to also import generic Parquet files. One problematic thing is that with Iceberg hidden partitioning we actually have the source-id that points to the field where the data is being kept. If the Hive partitioning is just arbitrary, eg: ```sql INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount FROM some_other_table ``` In this case there is no relation between the partition and any column in the table. In Iceberg you would expect something like: ```sql INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount, created_at FROM some_other_table ``` Where the partitioning is `year(created_at)`. If this column is not in there, I don't think we can import it into Iceberg because we cannot set the source-id of the partition spec. I would also expect the user to pre-create the partition spec prior to the import, because inferring is tricky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org