GitHub user leborchuk added a comment to the discussion: [Proposal] Iceberg subsystem for datalake_fdw — design proposal
> Thank you very much. First, let me respond to several questions: > > 1. **Why use Java Iceberg instead of iceberg-cpp** > Currently, iceberg-cpp cannot meet our requirements. Although we have made > some efforts, iceberg-cpp is still far from mature. By using the Java > implementation of Iceberg, we can support the latest features such as Iceberg > V3, V4, etc., in later stages. The Java-side Iceberg always maintains the > latest version. > 2. **datalake_agent will be integrated into Cloudberry** > datalake_agent will include the Java Iceberg JAR package. It will be > mainly responsible for parsing Iceberg metadata on the QD node, then > dispatching and passing the metadata information to the segments. The > segments will only be in charge of loading data. > The advantages of this approach are: > > * The Java Iceberg JAR package is always up-to-date, allowing us to easily > follow the latest code to implement features and support Iceberg V3, V4. > * It reduces the pressure of metadata access. > > 3. **Optimal performance** > We plan to use QE to perform unified data reading, which is faster than > parsing by a single PXF process alone. For further performance optimization, > we can refer more to optimizations for Parquet in projects such as Apache > Arrow or DataFusion. > I believe pure performance optimization is not an issue; the higher > priority is to ensure complete functionality. > 4. **Caching for object storage and Hadoop** > Caching does significantly impact overall performance. However, we plan to > reserve a dedicated read/write IO layer for users to implement their own best > practices. This depends on how users define their own file IO. > We will provide basic methods for accessing object storage and HDFS. Users > can also implement their own optimized IO methods if needed. > 5. **Regarding Polaris** > This is a good question. However, I would like to clarify what integrating > Polaris into Cloudberry specifically means. > Does it mean hosting the Polaris service directly on Cloudberry? Or > hosting Polaris metadata on Cloudberry? @leborchuk 1. Yes, it sounds wise to use mature project. Iceberg java is great and so no need to write all functions once again just to make sure it launches inside main process. 2. Yes, datalake_agent sounds good. But is it possible to define stable serializable RPC interface for interacting with the datalake_agent. What it should be? protobuf + GRPC? 3. I cannot say if optimal performance is crucial or not but I'm afraid we will have a strong demand for the performance. Not optimal but fast enough to make it sense to use the extension. What is the primary purpose for which you are considering using Iceberg? Our scenario is as follows. ### (1) Sharing data There is a lot of data that does not fit into a single greenplum cluster, so we need to create several smaller clusters, say up to 10, each with around 1-2 racks size. The problem is how to upload the data to these clusters. Copying the same data across 10 different clusters is not practical, time-consuming and leads to the growth of the clusters. Instead, we can load the data into an iceberg, and then use extensions to read it from different clusters. We need to make sure that the reading is no slower than reading from local files. No recording is required for this scenario, as the data can be generated by other databases, such as Spark/Trino/StarRocks. You can see a code for the GP6 extension in the tea project. ( https://github.com/lithium-tech/tea ) ### (2) Archive data Write data from GP to S3 and store catalog info for later re-read them. Allows you to reduce the cluster size. Right now there are no write functionality in GP extensions. But performance here is not so crucial, you could write data to archive in a background. Though you shouldn't spend CPU aimlessly, GP clusters usually have little free CPU and memory. I'd like to participate in all activities. But want to assess my capabilities soberly. I will be able to focus now primarily on scenario **(1) Sharing data**. I think I can test this code on a production-like installation. And only if succeed there it would be wise to move further. If not - we will need to continue working on the architecture. Yes, the current approach is `fdw`, but `TableAm` approach looks more promising. There is also an interesting aspect: how exactly to work with metadata? First, it would be great if we could import a schema so as not to have to create objects ourselves. Secondly, we need to figure out how to handle columns and their data types. Ideally, I would like to have something like a view, where you create an iceberg table and not say which columns you want - just select everything. And then depending on the (iceberg) transaction you can see different column set and their types in the table. 4. Sorry for the direct question, but do you have any evidence? We tried to cache data in yezzey project - https://github.com/open-gpdb/yezzey - no performance benefits. And while testing starrocks (iceberg caching is enabled in it by setting) - again no significant differences in TPC-H queries (Datacache in tpc-h provides about 10% performance compared to reading directly from S3.). We use yproxy (https://github.com/open-gpdb/yproxy) mainly for limit input/output, memory and CPU consumption. This turned out to be more important than caching. 5. Polaris I am not sure, we're still discussing it. Should it be Polaris or maybe https://github.com/apache/gravitino ? Does Cloudberry really good at oltp workload from catalog or something else should be used. No answers right now. GitHub link: https://github.com/apache/cloudberry/discussions/1683#discussioncomment-16687131 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
