syun64 commented on PR #506: URL: https://github.com/apache/iceberg-python/pull/506#issuecomment-1995003905
> So both of the approaches have pro's and con's. One thing I would like to avoid is having to rely on Hive directly, this will make sure that we can generalize it to also import generic Parquet files. > > One problematic thing is that with Iceberg hidden partitioning we actually have the source-id that points to the field where the data is being kept. If the Hive partitioning is just arbitrary, eg: > > ```sql > INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount FROM some_other_table > ``` > > In this case there is no relation between the partition and any column in the table. In Iceberg you would expect something like: > > ```sql > INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount, created_at FROM some_other_table > ``` > > Where the partitioning is `year(created_at)`. If this column is not in there, I don't think we can import it into Iceberg because we cannot set the source-id of the partition spec. > > I would also expect the user to pre-create the partition spec prior to the import, because inferring is tricky. Thank you for the context @Fokko . What I meant by partition inference is the act of inferring the partition values instead of the Partition Spec itself. So this function only runs after the Iceberg Table has been created with its expected PartitionSpec. But because Hive tables have the partition values in the file paths instead of in the actual data files, I'm proposing that we have the two modes of partition value inference: one from the file paths, and the other based on the upper and lower bound values from the parquet metadata -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org