Re: [PR] [WIP] Add Data Files from Parquet Files [iceberg-python]

via GitHub Wed, 13 Mar 2024 08:56:28 -0700


Fokko commented on PR #506:
URL: https://github.com/apache/iceberg-python/pull/506#issuecomment-1994769768


   So both of the approaches have pro's and con's. One thing I would like to 
avoid is having to rely on Hive directly, this will make sure that we can 
generalize it to also import generic Parquet files.
   
   One problematic thing is that with Iceberg hidden partitioning we actually 
have the source-id that points to the field where the data is being kept. If 
the Hive partitioning is just arbitrary, eg:
   
   ```sql
   INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount 
FROM some_other_table
   ```
   
   In this case there is no relation between the partition and any column in 
the table. In Iceberg you would expect something like:
   
   ```sql
   INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount, 
created_at FROM some_other_table
   ```
   
   Where the partitioning is `year(created_at)`. If this column is not in 
there, I don't think we can import it into Iceberg because we cannot set the 
source-id of the partition spec.
   
   I would also expect the user to pre-create the partition spec prior to the 
import, because inferring is tricky.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [WIP] Add Data Files from Parquet Files [iceberg-python]

Reply via email to