GitHub user leborchuk added a comment to the discussion: [Proposal] Iceberg subsystem for datalake_fdw — design proposal
Hi! Thank you very much for sharing your thoughts and ideas! We here in Moscow have struggled with the same issue. I must say that everyone is crazy about lakehouse architecture, especially since no one fully understands what it actually means. But anyway, I have formulated their wishes for myself as "using various databases to work with well-structured transactional data" I really appreciate efforts and willing to participate in development process. That's why it's very important for me to understand why we are doing it, which parts are important, and what types of work should be done first and what can be postponed until later stages. I'd like to focus on: ### The best SELECT performance Where the Cloudberry place in lakehouse world? Everyone know about Trino and Apache Doris/Starrocks, and most probably the kernel of feature lakehouse system will be one of them, not Cloudberry. We could try to catch up with them and achieve feature parity, doing the same as these products, not worse for a start, and preferably better. It's real, but it also takes a lot of effort and time. And still does not succeed. If we only accept that there are other databases and they do a better job and they are more likely to be used for it. We can set priorities and start by doing something better than everyone else. I mean we in MPP get used to everything should be properly distributed. And use one of the best cost-based optimizers to produce execution plan. Let's: A. Define the amount of work for each QE on a QD on a planning phase, using statistics and cost models. B. Replan query/reassign the list of reading parquet files to worker if we missed with selectivity estimation C. Use various optimizations on QE like threads/SIMD instructions etc. We have [iceberg-cxx](https://github.com/lithium-tech/iceberg-cxx) - the same as iceberg-cpp but right now has a better performance. D. Use special proxy to get data from S3. Proxy could be used for IO-control and as a caching layer, see [Simplified workflow used in yezzey](https://github.com/open-gpdb/yezzey#simplified-workflow) ### Native Polaris intergration Why place metadata catalog outside the Cloudberry cluster? Let's make it a first-class citizen. One could configure the Apache Cloudberry cluster with the Polaris catalog. The Cloudberry can store data, and it can also be used for storing Polaris catalog data. And so, Cloudberry is once again the central element of the lakehouse. GitHub link: https://github.com/apache/cloudberry/discussions/1683#discussioncomment-16637882 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
