GitHub user leborchuk added a comment to the discussion: [Proposal] Iceberg 
subsystem for datalake_fdw — design proposal

Hi!

Thank you very much for sharing your thoughts and ideas! We here in Moscow have 
struggled with the same issue. I must say that everyone is crazy about 
lakehouse architecture, especially since no one fully understands what it 
actually means. But anyway, I have formulated their wishes for myself as "using 
various databases to work with well-structured transactional data"

I really appreciate efforts and willing to participate in development process.

That's why it's very important for me to understand why we are doing it, which 
parts are important, and what types of work should be done first and what can 
be postponed until later stages. I'd like to focus on:

### The best SELECT performance

Where the Cloudberry place in lakehouse world? Everyone know about Trino and 
Apache Doris/Starrocks, and most probably the kernel of feature lakehouse 
system will be one of them, not Cloudberry. We could try to catch up with them 
and achieve feature parity, doing the same as these products, not worse for a 
start, and preferably better. It's real, but it also takes a lot of effort and 
time. And still does not succeed. 

If we only accept that there are other databases and they do a better job and 
they are more likely to be used for it. We can set priorities and start by 
doing something better than everyone else. I mean we in MPP get used to 
everything should be properly distributed. And use one of the best cost-based 
optimizers to produce execution plan.

Let's:
A. Define the amount of work for each QE on a QD on a planning phase, using 
statistics and cost models.
B. Replan query/reassign the list of reading parquet files to worker if we 
missed with selectivity estimation
C. Use various optimizations on QE like threads/SIMD instructions etc. We have 
[iceberg-cxx](https://github.com/lithium-tech/iceberg-cxx) - the same as 
iceberg-cpp but right now has a better performance. 
D. Use special proxy to get data from S3. Proxy could be used for IO-control 
and as a caching layer, see [Simplified workflow used in 
yezzey](https://github.com/open-gpdb/yezzey#simplified-workflow) 

### Native Polaris intergration

Why place metadata catalog outside the Cloudberry cluster? Let's make it a 
first-class citizen. One could configure the Apache Cloudberry cluster with the 
Polaris catalog. The Cloudberry can store data, and it can also be used for 
storing Polaris catalog data. And so, Cloudberry is once again the central 
element of the lakehouse.

GitHub link: 
https://github.com/apache/cloudberry/discussions/1683#discussioncomment-16637882

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to