Re: [D] [Proposal] Iceberg subsystem for datalake_fdw — design proposal [cloudberry]

via GitHub Fri, 08 May 2026 12:48:16 -0700


GitHub user yjhjstz edited a comment on the discussion: [Proposal] Iceberg 
subsystem for datalake_fdw — design proposal


The Java agent concern goes beyond "one extra hop". Fragment planning 
(`/fragments`) is called on **every SELECT**. At scale — millions of files, 
concurrent workloads — the gRPC + JVM  deserialization cost shows up in query 
latency directly. More fundamentally, the JVM process model conflicts with 
PostgreSQL's fork/signal/crash-recovery model: memory limits are split across 
two runtimes (resource groups can't unify them), GC pauses can cause gRPC 
timeouts on the QD, and the agent restart window leaves all Iceberg tables 
temporarily unwritable. These are permanent architectural constraints, not 
things we can optimize away later.

  | | Java agent | iceberg-cpp |                                                
                                                                                
                           
  |--|--|--|
  | Short-term delivery | Fast (iceberg-java ready to use) | Slower (gaps to 
fill) |                                                                         
                              
  | Long-term operational cost | High | Low |               
  | Query performance ceiling | Bounded by gRPC + JVM | No extra overhead |
  | Architectural consistency | Poor fit for a C/C++ database | Native |        
                                                                                
                           
  | Format compatibility risk | Very low | Medium (must track spec carefully) |
  | Reversibility | Nearly impossible once shipped | Continuously evolvable |   
                                                                                
                           
                                                            
  The right answer is **[Apache 
iceberg-cpp](https://github.com/apache/iceberg-cpp)**. The gaps (CAS commit, 
catalog backends, snapshot/manifest writing) are well-specified engineering  
work — one-time investment. The Java agent's architectural debt is paid forever.
                                                                                
                                                                                
                           
  StarRocks and Doris — currently the strongest Iceberg MPP readers — are pure 
C++, no Java metadata sidecar. The TPC-H numbers shared above already show 
Cloudberry behind. Adding a Java 
  agent makes catching up harder, not easier.
                                                                                
                                                                                
                           
 Cloudberry is an Apache incubating project. Co-investing in Apache iceberg-cpp 
is a better community story and a better technical foundation than wrapping 
iceberg-java behind gRPC. I'd strongly advocate for **not shipping a Java agent 
as part of Cloudberry core**, and instead contributing the missing pieces 
upstream to Apache iceberg-cpp together.

GitHub link: 
https://github.com/apache/cloudberry/discussions/1683#discussioncomment-16856371

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] [Proposal] Iceberg subsystem for datalake_fdw — design proposal [cloudberry]

Reply via email to