Re: [D] [Proposal] Iceberg subsystem for datalake_fdw — design proposal [cloudberry]

via GitHub Wed, 22 Apr 2026 04:15:53 -0700


GitHub user MisterRaindrop edited a comment on the discussion: [Proposal] 
Iceberg subsystem for datalake_fdw — design proposal


Thank you very much. First, let me respond to several questions:

1. **Why use Java Iceberg instead of iceberg-cpp**
Currently, iceberg-cpp cannot meet our requirements. Although we have made some 
efforts, iceberg-cpp is still far from mature. By using the Java implementation 
of Iceberg, we can support the latest features such as Iceberg V3, V4, etc., in 
later stages. The Java-side Iceberg always maintains the latest version.

2. **datalake_agent will be integrated into Cloudberry**
datalake_agent will include the Java Iceberg JAR package. It will be mainly 
responsible for parsing Iceberg metadata on the QD node, then dispatching and 
passing the metadata information to the segments. The segments will only be in 
charge of loading data.
The advantages of this approach are:
- The Java Iceberg JAR package is always up-to-date, allowing us to easily 
follow the latest code to implement features and support Iceberg V3, V4.
- It reduces the pressure of metadata access.

3. **Optimal performance**
We plan to use QE to perform unified data reading, which is faster than parsing 
by a single PXF process alone. For further performance optimization, we can 
refer more to optimizations for Parquet in projects such as Apache Arrow or 
DataFusion.
I believe pure performance optimization is not an issue; the higher priority is 
to ensure complete functionality.

4. **Caching for object storage and Hadoop**
Caching does significantly impact overall performance. However, we plan to 
reserve a dedicated read/write IO layer for users to implement their own best 
practices. This depends on how users define their own file IO.
We will provide basic methods for accessing object storage and HDFS. Users can 
also implement their own optimized IO methods if needed.

5. **Regarding Polaris**
This is a good question. However, I would like to clarify what integrating 
Polaris into Cloudberry specifically means.
Does it mean hosting the Polaris service directly on Cloudberry? Or hosting 
Polaris metadata on Cloudberry? @leborchuk 

GitHub link: 
https://github.com/apache/cloudberry/discussions/1683#discussioncomment-16645685

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] [Proposal] Iceberg subsystem for datalake_fdw — design proposal [cloudberry]

Reply via email to