GitHub user leborchuk added a comment to the discussion: [Proposal] Iceberg 
subsystem for datalake_fdw — design proposal

> Thank you very much. First, let me respond to several questions:
> 
> 1. **Why use Java Iceberg instead of iceberg-cpp**
>    Currently, iceberg-cpp cannot meet our requirements. Although we have made 
> some efforts, iceberg-cpp is still far from mature. By using the Java 
> implementation of Iceberg, we can support the latest features such as Iceberg 
> V3, V4, etc., in later stages. The Java-side Iceberg always maintains the 
> latest version.
> 2. **datalake_agent will be integrated into Cloudberry**
>    datalake_agent will include the Java Iceberg JAR package. It will be 
> mainly responsible for parsing Iceberg metadata on the QD node, then 
> dispatching and passing the metadata information to the segments. The 
> segments will only be in charge of loading data.
>    The advantages of this approach are:
> 
> * The Java Iceberg JAR package is always up-to-date, allowing us to easily 
> follow the latest code to implement features and support Iceberg V3, V4.
> * It reduces the pressure of metadata access.
> 
> 3. **Optimal performance**
>    We plan to use QE to perform unified data reading, which is faster than 
> parsing by a single PXF process alone. For further performance optimization, 
> we can refer more to optimizations for Parquet in projects such as Apache 
> Arrow or DataFusion.
>    I believe pure performance optimization is not an issue; the higher 
> priority is to ensure complete functionality.
> 4. **Caching for object storage and Hadoop**
>    Caching does significantly impact overall performance. However, we plan to 
> reserve a dedicated read/write IO layer for users to implement their own best 
> practices. This depends on how users define their own file IO.
>    We will provide basic methods for accessing object storage and HDFS. Users 
> can also implement their own optimized IO methods if needed.
> 5. **Regarding Polaris**
>    This is a good question. However, I would like to clarify what integrating 
> Polaris into Cloudberry specifically means.
>    Does it mean hosting the Polaris service directly on Cloudberry? Or 
> hosting Polaris metadata on Cloudberry? @leborchuk

1. Yes, it sounds wise to use mature project. Iceberg java is great and so no 
need to write all functions once again just to make sure it launches inside 
main process. 

2. Yes, datalake_agent sounds good. But is it possible to define stable 
serializable RPC interface for interacting with the datalake_agent. What it 
should be? protobuf + GRPC? 

3. I cannot say if optimal performance is crucial or not but I'm afraid we will 
have a strong demand for the performance. Not optimal but fast enough to make 
it sense to use the extension.

What is the primary purpose for which you are considering using Iceberg?

Our scenario is as follows.
### (1) Sharing data
There is a lot of data that does not fit into a single greenplum cluster, so we 
need to create several smaller clusters, say up to 10, each with around 1-2 
racks size. The problem is how to upload the data to these clusters. Copying 
the same data across 10 different clusters is not practical, time-consuming and 
leads to the growth of the clusters. Instead, we can load the data into an 
iceberg, and then use extensions to read it from different clusters. We need to 
make sure that the reading is no slower than reading from local files. No 
recording is required for this scenario, as the data can be generated by other 
databases, such as Spark/Trino/StarRocks.

You can see a code for the GP6 extension in the tea project. ( 
https://github.com/lithium-tech/tea )

### (2) Archive data
Write data from GP to S3 and store catalog info for later re-read them. Allows 
you to reduce the cluster size. Right now there are no write functionality in 
GP extensions. But performance here is not so crucial, you could write data to 
archive in a background. Though you shouldn't spend CPU aimlessly, GP clusters 
usually have little free CPU and memory.

I'd like to participate in all activities. But want to assess my capabilities 
soberly. I will be able to focus now primarily on scenario **(1) Sharing 
data**. I think I can test this code on a production-like installation. And 
only if succeed there it would be wise to move further. If not - we will need 
to continue working on the architecture.

Yes, the current approach is `fdw`, but `TableAm` approach looks more 
promising. 

There is also an interesting aspect: how exactly to work with metadata? First, 
it would be great if we could import a schema so as not to have to create 
objects ourselves. Secondly, we need to figure out how to handle columns and 
their data types. Ideally, I would like to have something like a view, where 
you create an iceberg table and not say which columns you want - just select 
everything. And then depending on the (iceberg) transaction you can see 
different column set and their types in the table.

4. Sorry for the direct question, but do you have any evidence? We tried to 
cache data in yezzey project - https://github.com/open-gpdb/yezzey - no 
performance benefits. And while testing starrocks (iceberg caching is enabled 
in it by setting) - again no significant differences in TPC-H queries 
(Datacache in tpc-h provides about 10% performance compared to reading directly 
from S3.). 

We use yproxy (https://github.com/open-gpdb/yproxy) mainly for limit 
input/output, memory and CPU consumption. This turned out to be more important 
than caching.

5. Polaris

I am not sure, we're still discussing it. Should it be Polaris or maybe 
https://github.com/apache/gravitino ? Does Cloudberry really good at oltp 
workload from catalog or something else should be used. No answers right now.

GitHub link: 
https://github.com/apache/cloudberry/discussions/1683#discussioncomment-16687131

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to