Re: [PR] [WIP] Add Data Files from Parquet Files [iceberg-python]

via GitHub Wed, 13 Mar 2024 10:02:12 -0700


syun64 commented on PR #506:
URL: https://github.com/apache/iceberg-python/pull/506#issuecomment-1995003905


   > So both of the approaches have pro's and con's. One thing I would like to 
avoid is having to rely on Hive directly, this will make sure that we can 
generalize it to also import generic Parquet files.
   > 
   > One problematic thing is that with Iceberg hidden partitioning we actually 
have the source-id that points to the field where the data is being kept. If 
the Hive partitioning is just arbitrary, eg:
   > 
   > ```sql
   > INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount 
FROM some_other_table
   > ```
   > 
   > In this case there is no relation between the partition and any column in 
the table. In Iceberg you would expect something like:
   > 
   > ```sql
   > INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount, 
created_at FROM some_other_table
   > ```
   > 
   > Where the partitioning is `year(created_at)`. If this column is not in 
there, I don't think we can import it into Iceberg because we cannot set the 
source-id of the partition spec.
   > 
   > I would also expect the user to pre-create the partition spec prior to the 
import, because inferring is tricky.
   
   Thank you for the context @Fokko . What I meant by partition inference is 
the act of inferring the partition values instead of the Partition Spec itself. 
So this function only runs after the Iceberg Table has been created with its 
expected PartitionSpec.
   
   But because Hive tables have the partition values in the file paths instead 
of in the actual data files, I'm proposing that we have the two modes of 
partition value inference: one from the file paths, and the other based on the 
upper and lower bound values from the parquet metadata


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [WIP] Add Data Files from Parquet Files [iceberg-python]

Reply via email to