JonasJ-ap opened a new pull request, #6642: URL: https://github.com/apache/iceberg/pull/6642
This PR is under construction, but I want to put it here for some initial feedback and discussion about the conversion from Apache Hudi to Apache Iceberg ## Overview This PR aims to add a module called `iceberg-hudi` which contains public API and a base implementation to snapshot a hudi table to iceberg table. In expectation, the base implementation should rely on `hudi-common` module to extract metadata, timeline, locations of datafiles, and other information necessary for the conversion. `copy-on-write`(COW) and `merge-on-read` (MOR) are two types of hudi table. As the initial implementation, this PR will focus on the conversion logic for COW tables. The overall structure of the module is expected to be similar to #6449 . However, things may change as hudi is different from the delta lake. Also, due to the complexity of the conversion, I may make a proposal later for further discussion in the community. ## High-level Ideas The base implementation of the snapshot action involves schema conversion and timeline replay. The idea here is to map every completed `COMMIT` action on the timeline to an iceberg snapshot. The conception of `COW` can be mapped to the `overwrite` operation in iceberg. In other words, for every update of a datafile, we will `delete` the previous version datafile and `add` the newly created datafile to the iceberg table. ## Need Further Investigations: 1. May need some way to handle `Hard Deletes` 2. We may take advantage of column_stats stored in Hudi's metadata table rather than using FileIO to extract metrics from datafiles. ## Dependency Issue 1. Hudi suggests `Java 8` only, may need further discussion on how to integrate hudi dependency to the iceberg project (may be limit the module to be compiled only for Java 8) 2. `hudi-common` has [dependency conflict](https://github.com/apache/hudi/issues/7409) with the `hudi-spark-bundle`, which is intended to be used for integration test. May need to find alternative way to construct tests later If you have some comments or suggestions on how to convert hudi to iceberg, please feel free to share them here. Thank you in advance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org