[GitHub] [iceberg] JonasJ-ap opened a new pull request, #6642: WIP: Support Snapshot Copy-On-Write Hudi Table to Iceberg Table

via GitHub Sat, 21 Jan 2023 21:41:34 -0800


JonasJ-ap opened a new pull request, #6642:
URL: https://github.com/apache/iceberg/pull/6642


   This PR is under construction, but I want to put it here for some initial 
feedback and discussion about the conversion from Apache Hudi to Apache Iceberg
   
   ## Overview
   This PR aims to add a module called `iceberg-hudi` which contains public API 
and a base implementation to snapshot a hudi table to iceberg table. In 
expectation, the base implementation should rely on `hudi-common` module to 
extract metadata, timeline, locations of datafiles, and other information 
necessary for the conversion. 
   
   `copy-on-write`(COW) and `merge-on-read` (MOR) are two types of hudi table. 
As the initial implementation, this PR will focus on the conversion logic for 
COW tables. 
   
   The overall structure of the module is expected to be similar to #6449 . 
However, things may change as hudi is different from the delta lake. Also, due 
to the complexity of the conversion, I may make a proposal later for further 
discussion in the community.
   
   ## High-level Ideas
   The base implementation of the snapshot action involves schema conversion 
and timeline replay. The idea here is to map every completed `COMMIT` action on 
the timeline to an iceberg snapshot. The conception of `COW` can be mapped to 
the `overwrite` operation in iceberg. In other words, for every update of a 
datafile, we will `delete` the previous version datafile and `add` the newly 
created datafile to the iceberg table. 
   
   ## Need Further Investigations:
   1. May need some way to handle `Hard Deletes` 
   2. We may take advantage of column_stats stored in Hudi's metadata table 
rather than using FileIO to extract metrics from datafiles.
   
   ## Dependency Issue
   1. Hudi suggests `Java 8` only, may need further discussion on how to 
integrate hudi dependency to the iceberg project (may be limit the module to be 
compiled only for Java 8)
   2. `hudi-common` has [dependency 
conflict](https://github.com/apache/hudi/issues/7409) with the 
`hudi-spark-bundle`, which is intended to be used for integration test. May 
need to find alternative way to construct tests later 
   
   
   If you have some comments or suggestions on how to convert hudi to iceberg, 
please feel free to share them here. Thank you in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] JonasJ-ap opened a new pull request, #6642: WIP: Support Snapshot Copy-On-Write Hudi Table to Iceberg Table

Reply via email to