NoahKusaba opened a new pull request, #2613:
URL: https://github.com/apache/iceberg-rust/pull/2613

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   - Closes https://github.com/apache/datafusion-ballista/issues/1241
   
   ## What changes are included in this PR?
   
   <!--
   Provide a summary of the modifications in this PR. List the main changes 
such as new features, bug fixes, refactoring, or any other updates.
   -->
   
   Adds a new iceberg-ballista crate, which provides a distributed-query driver 
for Apache Iceberg for a distributed datafusion engine Apache 
Datafusion-Ballista + the targeted changes to iceberg-datafusion that make 
Iceberg's existing plan nodes serializable so they can cross node boundaries.
   
   The core problem it solves
   
   Iceberg's DataFusion integration already produces complete physical read and 
write plans, but every Iceberg plan node holds live, non-serializable state 
(Arc<dyn Catalog>, an open Table/FileIO). Ballista ships logical and physical 
plans to remote schedulers/executors, so those nodes couldn't travel. This 
branch closes that gap with one consistent idea: serialize a minimal 
self-contained recipe (IcebergCatalogConfig + identifiers), rebuild the live 
objects on the receiving node.
   
   - IcebergLogicalCodec:  serializes the catalog-backed table provider (config 
+ table ident, plus snapshot/metadata variants) so the scheduler can rebuild it 
and do physical planning, including INSERT.
   - IcebergPhysicalCodec:  serializes the four Iceberg execution nodes 
(IcebergTableScan, IcebergWriteExec, IcebergCommitExec, IcebergMetadataScan) 
and the PartitionExpr physical expression.
   - Tagged-envelope wire framing (TAG_ICEBERG / TAG_DELEGATED) — every blob 
carries a leading tag; non-Iceberg nodes are delegated to Ballista's own codec 
rather than byte-sniffed, so shuffles/sorts/etc. keep working and an unknown 
tag is a hard error, not a silent misparse. --> Based off comments from 
https://github.com/milenkovicm/ballista_delta 
   - serde.rs runtime bridge: block_on that adapts to whatever runtime the sync 
codec entry point is on (multi-thread → block_in_place; current-thread → 
dedicated thread; none → temp runtime), plus a process-wide catalog cache 
(durable-runtime-only) and a loader-based build_catalog that supports every 
catalog type (rest, sql, glue, hms, s3tables) and storage backend via 
OpenDalResolvingStorageFactory.
   Public API: register_iceberg_codecs(SessionConfig) and 
register_iceberg_table(...); a runnable examples/standalone-iceberg-write.rs.
   
   ## Are these changes tested?
   
   Ballista tests:
   - Distributed reads / writes tested against dockerized minio iceberg catalog 
( Standalone + multi-executor cluster). Also tests partitioned files writes + 
iceberg table registration
   - roundtrips testing that configurations for nodes are maintained through 
serialization -> deserialization 
   
   Datafusion tests:
   - Unit tests for config propagation and snapshot pinning.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to