GitHub user MisterRaindrop edited a comment on the discussion: [Proposal] Iceberg subsystem for datalake_fdw — design proposal
Thank you very much. First, let me respond to several questions: 1. **Why use Java Iceberg instead of iceberg-cpp** Currently, iceberg-cpp cannot meet our requirements. Although we have made some efforts, iceberg-cpp is still far from mature. By using the Java implementation of Iceberg, we can support the latest features such as Iceberg V3, V4, etc., in later stages. The Java-side Iceberg always maintains the latest version. 2. **datalake_agent will be integrated into Cloudberry** datalake_agent will include the Java Iceberg JAR package. It will be mainly responsible for parsing Iceberg metadata on the QD node, then dispatching and passing the metadata information to the segments. The segments will only be in charge of loading data. The advantages of this approach are: - The Java Iceberg JAR package is always up-to-date, allowing us to easily follow the latest code to implement features and support Iceberg V3, V4. - It reduces the pressure of metadata access. 3. **Optimal performance** We plan to use QE to perform unified data reading, which is faster than parsing by a single PXF process alone. For further performance optimization, we can refer more to optimizations for Parquet in projects such as Apache Arrow or DataFusion. I believe pure performance optimization is not an issue; the higher priority is to ensure complete functionality. 4. **Caching for object storage and Hadoop** Caching does significantly impact overall performance. However, we plan to reserve a dedicated read/write IO layer for users to implement their own best practices. This depends on how users define their own file IO. We will provide basic methods for accessing object storage and HDFS. Users can also implement their own optimized IO methods if needed. 5. **Regarding Polaris** This is a good question. However, I would like to clarify what integrating Polaris into Cloudberry specifically means. Does it mean hosting the Polaris service directly on Cloudberry? Or hosting Polaris metadata on Cloudberry? @leborchuk GitHub link: https://github.com/apache/cloudberry/discussions/1683#discussioncomment-16645685 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
