Re: [D] [Proposal] Iceberg subsystem for datalake_fdw — design proposal [cloudberry]

via GitHub Fri, 08 May 2026 12:31:56 -0700


GitHub user yjhjstz added a comment to the discussion: [Proposal] Iceberg 
subsystem for datalake_fdw — design proposal


I'd like to advocate for using https://github.com/apache/iceberg-cpp on the 
metadata path instead of the Java agent, and for Cloudberry to invest in that 
community together.
                                                                                
                                                                                
                           
  Why I think iceberg-cpp is the right long-term bet:                           
                                                                                
                           
                                                                                
                                                                                
                           
  1. No JVM sidecar. The datalake_proxy bgworker that forks and supervises a 
Java process is real operational complexity — JVM heap tuning, GC pauses, two 
processes to monitor, and a gRPC
   hop on every metadata operation. A native C++ library eliminates all of this.
  2. Architecture coherence. Cloudberry's core is C/C++. A native library fits 
naturally into the postmaster/backend process model; a Java subprocess is an 
alien runtime that complicates 
  crash handling, signal propagation, and resource control.                     
                                                                                
                           
  3. iceberg-cpp is early, but that's precisely the opportunity. The concern 
about maturity is valid today, but Apache iceberg-cpp is an official Apache 
project on an active growth
  trajectory. Rather than working around its gaps by routing through 
iceberg-java, Cloudberry can close those gaps — contributing the missing spec 
coverage (partition evolution, equality 
  deletes, CAS commit logic, catalog backends) would benefit the entire 
ecosystem, not just Cloudberry.
  4. Shared investment with the community. leborchuk mentioned 
https://github.com/lithium-tech/iceberg-cxx as a performance reference. There's 
also TEA. There are clearly multiple teams  
  working on C++ Iceberg I/O. If we converge on Apache iceberg-cpp as the 
shared foundation, the effort compounds instead of fragmenting.                 
                                 
  5. The metadata path is not forever "not hot". Fragment planning (/fragments) 
is called on every SELECT. As table file counts grow into the millions, the 
gRPC round-trip and Java
  deserialization will show up in query latency. 

GitHub link: 
https://github.com/apache/cloudberry/discussions/1683#discussioncomment-16856371

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] [Proposal] Iceberg subsystem for datalake_fdw — design proposal [cloudberry]

Reply via email to