Re: [D] [Proposal] Iceberg subsystem for datalake_fdw — design proposal [cloudberry]

via GitHub Mon, 20 Apr 2026 00:52:21 -0700


GitHub user MisterRaindrop edited a discussion: [Proposal] Iceberg subsystem 
for datalake_fdw — design proposal


### Proposers

@MisterRaindrop 

### Proposal Status

Under Discussion

### Abstract

## 1. Abstract
Cloudberry does not have a complete set of plug-in tools for accessing various 
data sources.

I plan to design a data lake approach to access these data sources, and evolve 
Cloudberry toward a data lake–enabled architecture.

`datalake_fdw` extends Cloudberry with two complementary ways of accessing 
data-lake storage:

1. **FDW foreign-table read / append** : direct read / append of Parquet / ORC 
/ Avro / Text / CSV files on S3 / HDFS / OSS.
2. **Native Iceberg tables** (added by this design): `CREATE ICEBERG TABLE` 
inside CB to create and manage Apache Iceberg tables with full **SELECT / 
INSERT / UPDATE / DELETE / VACUUM**, Schema Evolution, and snapshot-based Read 
Committed isolation.

This document focuses on the second part — the design, the key decisions, and 
the open questions — and is meant for community review.

### Motivation

## 2. Motivation & Goals
### 2.1 Why we need this

As an MPP data warehouse, Cloudberry has long lacked a **transactional read / 
write** entry point for data-lake formats, Iceberg in particular:

- The existing PXF-based FDW foreign tables are limited in capability;
- There is no Catalog concept, so CB cannot share metadata with the wider 
Iceberg ecosystem (Spark / Trino / Flink);
- There is no snapshot isolation or ACID, which makes lakehouse scenarios 
diverge from CB's native-table semantics.

The Iceberg subsystem aims to introduce Iceberg tables as **first-class "lake 
tables"** in CB without breaking PostgreSQL / Cloudberry transactional 
semantics:

- Same SQL entry point as native tables (`CREATE ICEBERG TABLE ...`, `INSERT`, 
`UPDATE`, `DELETE`, `VACUUM`);
- Metadata format fully compatible with the Iceberg community — snapshots 
written by CB must be directly readable by Spark / Trino;
- Write path is MPP-parallel; segments talk to object storage directly;
- Transactional semantics aligned with PG: a single transaction is either fully 
visible or fully rolled back, and `SAVEPOINT` is supported.

### 2.2 Goals

The first release of this design aims to deliver:

- **Catalog support**: Polaris / Hive Metastore / Builtin (CB-internal);
- **Storage support**: S3 (including MinIO / OSS) and HDFS (including HA + 
Kerberos);
- Read Committed isolation with concurrent commits;
- Reuse of the Iceberg community Java implementation to keep metadata-semantics 
maintenance cost low;
- MPP-parallel file-level execution; QEs write directly to object storage.

### 2.3 Non-goals (outside the first release)

- No explicit Serializable isolation;
- No partition-spec evolution; bucket / truncate / hour transforms not 
supported;
- No Branch / Tag / Time Travel queries;
- Does not replace the FDW raw-file path — both paths coexist.

### Implementation

## 3. Overall Architecture

The proposed design has four layers, split into a **metadata path** and a 
**data path**:

```
       ┌─────────────────────────────────────────────────────────────┐
       │  SQL:  CREATE / SELECT / INSERT / UPDATE / DELETE / VACUUM  │
       └─────────────────────────────────────────────────────────────┘
                                      │
       ┌─────────────────────────────────────────────────────────────┐
       │  Iceberg Table AM                                           │
       │  (makes Iceberg tables look like ordinary tables;           │
       │   tableam callbacks + transaction Tracker)                  │
       └─────────────────────────────────────────────────────────────┘
                   │                                       │
                   │ metadata                              │ data
                   ▼                                       ▼
       ┌──────────────────────────┐          ┌──────────────────────────┐
       │   Catalog FDW            │          │   Volume FDW             │
       │   Polaris / Hive /       │          │   S3 / HDFS              │
       │   Builtin (CB sys tbl)   │          │                          │
       └──────────────────────────┘          └──────────────────────────┘
                   │                                       │
                  gRPC                                     │
                   ▼                                       ▼
       ┌──────────────────────────┐          ┌──────────────────────────┐
       │   datalake_agent (jar)   │          │   Provider (C++)         │
       │   Java / iceberg-java    │          │   Parquet reader/writer  │
       │   ↑ launched by          │          │   position/eq delete     │
       │     datalake_proxy       │          │                          │
       │     bgworker             │          │                          │
       └──────────────────────────┘          └──────────────────────────┘
                   │                                       │
                   ▼                                       ▼
           Catalog service / HMS                 Object store / HDFS
```

- All **metadata** operations (CREATE TABLE, plan files, commit snapshot, 
VACUUM rewrite) go through the agent and are handled by iceberg-java;
- All **data** operations (Parquet / position-delete file read / write) go 
through FDW → Provider; segments talk to storage directly, bypassing the agent;
- `datalake_agent` is a Java jar; it is launched and supervised by the PG 
bgworker `datalake_proxy` at postmaster startup (see §5.4);
- The RPC channel between PG and the agent is gRPC.

## 4. The Core Abstraction: Catalog × Volume × Table

The design splits an Iceberg table into three independently configurable, 
freely composable pieces:

| Abstraction | Responsibility | Supported |
|-------------|----------------|-----------|
| **Catalog** | Iceberg metadata directory: namespace / table listing, 
`metadata.json` location, schema-evolution history | Polaris REST / Hive 
Metastore / Builtin |
| **Volume**  | Where data files physically live: data files / delete files / 
manifest / metadata json | S3 (incl. MinIO / OSS) / HDFS (incl. HA + Kerberos) |
| **Table**   | The above two + column definitions + partition keys + table 
options | —— |

A Volume can be shared by multiple tables (different paths under the same 
bucket); a Catalog can reference multiple Volumes (different tables on 
different storage). Polaris is a special case — the storage configuration is 
dispatched by the Polaris service, so a user-side Volume is optional.

### Why Catalog and Volume are separated

In real deployments they are **orthogonal**:

- Some users already run a Hive Metastore and want data files on S3;
- Some users use Polaris as the catalog but keep two buckets — hot and cold — 
for different tables;
- Some users have no external catalog at all, only object storage.

Making Catalog and Volume two separate FDWs, each with its own Server / 
UserMapping, lets us cover every combination without inventing a new FDW for 
each.

### Builtin Catalog

For users with no external Catalog (Polaris / Hive) available, the design 
offers a **Builtin** option: the `metadata.json` location is stored directly in 
a CB system table. Data files still live on the Volume, and other engines can 
open the table through Iceberg's HadoopCatalog / FileIO using that path.

**Why we need it**: it removes the hard dependency on a Catalog service and 
lowers the barrier to entry. It also gives a zero-dependency option for the "CB 
is the only writer" single-writer scenario.

## 5. Components & Design Decisions

The following lists the design choice — and the reason behind it — for each key 
component.

### 5.1 Iceberg Table AM: why not a pure FDW

The most direct approach would be to keep using FDW, but two hard limitations 
get in the way:

- PG's FDW has limited support for UPDATE / DELETE, which does not fit 
Iceberg's requirement for full DML;
- FDW foreign tables are treated as second-class citizens in many parts of the 
planner / analyzer / resource group.

Table AM (`TableAmRoutine`), introduced in PG 14, is a first-class storage 
abstraction: from the SQL side an Iceberg table looks like an ordinary table, 
and UPDATE / DELETE / `ctid` semantics, transactional callbacks, and ANALYZE 
all come for free from the kernel.

The proposed approach is therefore: register Iceberg tables as a dedicated 
Table AM, and **have the AM delegate data I/O to the Volume FDW internally** 
(reusing the existing S3 / HDFS read / write code). We get the SQL consistency 
of tableam and avoid reimplementing the storage layer.

The core code will live in `src/am_iceberg/`. The AM handler itself is very 
thin; the main logic is planned to be organized as follows:

- `pg_iceberg_ddl.c` — `OAT_POST_CREATE / OAT_DROP` hook; creates/drops Iceberg 
tables via the Catalog on DDL;
- `pg_iceberg_catalog.c` — unified wrapper for all Catalog calls;
- `pg_iceberg_metadata.c` — manages the `iceberg.pg_iceberg_metadata` system 
table;
- `pg_iceberg_metadata_tracker.c` — transaction-scoped metadata tracker (see 
§5.6);
- `pg_iceberg_rewrite_plan.c` — QD ↔ QE JSON contract for VACUUM compaction.

### 5.2 Catalog FDW: abstracting three backends

`iceberg_catalog_fdw` abstracts metadata operations into a set of 
`IcebergCatalogOperation`s (create_table / load_table / drop_table / append / 
update / delete / get_fragment / get_statistics / plan_file_groups / commit_* 
and so on).

The Server's `type` option decides the backend:

| `type` | Backend |
|--------|---------|
| `polaris` | Polaris REST Catalog |
| `hive` | Hive Metastore (Kerberos supported) |
| unset | Builtin (CB system table) |

Upwards, the AM only sees `pg_iceberg_*_with_catalog()` functions. Downwards, 
`agent_cli` talks to the agent over RPC (Builtin is the exception — it 
short-circuits to the CB system table).

**Why FDW instead of a plain C function**: it lets us reuse PG's `CREATE SERVER 
/ USER MAPPING` for credentials and permissions, and unifies the configuration 
entry point across multiple Catalog types.

### 5.3 Volume FDW: the data-file I/O abstraction

`iceberg_volume_fdw` is planned to handle the actual read / write of **data 
files and delete files** (manifest / metadata json are managed by the Catalog 
side). It implements the full FDW interface: `GetForeignRelSize / 
GetForeignPaths / BeginForeignScan / BeginForeignModify / ...`.

Its responsibilities:

1. Take the fragment list from the Catalog FDW (passed into `fdw_private` at 
plan time);
2. On the QE side, filter out the fragments assigned to this segment by 
`segindex`;
3. Call the Provider to read / write Parquet;
4. After writing, serialize the file metadata produced by this QE (path / 
record_count / size / partition / whether it is a position-delete) into JSON 
and return it to the QD.

The Server's `type` option decides storage: `s3` / `s3b` (OSS / MinIO / OBS / 
…) / `hdfs`.

### 5.4 datalake_agent: why a separate Java service

This is the single most important design trade-off.

Iceberg's metadata semantics are complex: manifest lists, snapshot logs, 
partition-spec evolution, schema field-id mapping, optimistic CAS commit, and 
so on. The community's most invested, most mature implementation is 
`iceberg-java`.

Reimplementing all of this on the C / C++ side would cost us:

- A large initial implementation;
- Repeated effort on every Iceberg version upgrade;
- Format-compatibility risk (some defaults are hard-coded in the reference 
implementation and not fully documented).

Therefore the design delegates all metadata operations to a dedicated 
`datalake_agent` (Java Spring Boot, wrapping iceberg-java + hive-jdbc + 
hadoop-client). The interface is planned to cover:

`/iceberg/tables` — create / load / drop;
`/fragments` — plan files (with predicate pushdown);
`/modify` — incremental snapshot generation;
`/commit` — CAS commit;
`/plan-rewrite` + `/commit-rewrite` — VACUUM.

**Upside**:
- Compatibility: snapshots written by CB are byte-identical to the community 
format;
- Easy upgrades: picking up a new Iceberg version is just a jar swap on the 
agent;
- Stateless: every request carries the full configuration, making horizontal 
scaling easy.

**Cost**: one extra network hop — but **only on the metadata path**; data I/O 
still goes straight from C++ to storage, so throughput is unaffected.

#### Process lifecycle: managed by the `datalake_proxy` bgworker

To tie the agent's lifecycle to the CB cluster and spare users from babysitting 
a Java process, the design introduces `datalake_proxy` 
(`contrib/datalake_proxy/`), a PG background worker (bgworker):

- `datalake_proxy` is registered in `shared_preload_libraries` and starts with 
the postmaster;
- In `_PG_init`, a bgworker is registered that `fork`s a child process to run 
the agent jar on startup;
- If the child crashes, `datalake_proxy` restarts it;
- The GUC `datalake_proxy.register_datalake_proxy` toggles the feature; 
`datalake_proxy.dlagent_memory_limit` (default 2 GB) caps the agent's JVM heap;
- When the postmaster exits, the signal propagates through `datalake_proxy` to 
the agent for a clean shutdown.

>From the user's perspective this means "CB is up → Iceberg is available" — no 
>extra deployment, no extra supervisor.

#### RPC protocol: gRPC

JSON / REST has two pain points at scale — large fragment lists cost CPU to 
encode / decode, and plan-file results are slow to deserialize when they get 
big. The plan is to expose the same interface over protobuf + gRPC:

- Bidirectional streaming interfaces (e.g. `get_fragments` can be 
server-streamed) reduce QD memory pressure;
- protobuf saves bandwidth and CPU;
- gRPC's built-in health checks / load balancing pave the way for a 
multi-instance agent deployment in the future.

### 5.5 Provider layer: the data plane

`src/provider/iceberg/` is planned to be a C++ implementation covering 
Iceberg's data plane:

- Parquet / ORC row readers and writers;
- Position-delete file I/O (schema = `file_path string, pos long`);
- Delete-index construction (data file → deleted-positions bitmap);
- Equality-delete read (read-only for now);
- Translation from Iceberg `FileScanTask` into a row reader.

**Why Provider does not go through the agent**: data I/O is the system's 
throughput bottleneck. Only by having each segment read / write storage 
independently and in parallel can we sustain MPP-scale writes. Meanwhile, 
mature C++ libraries already exist for Parquet (arrow-cpp / orc) — reusing them 
is far more efficient than routing through an agent.

### 5.6 Metadata Tracker: the heart of transactional semantics

**The problem**: Iceberg uses optimistic CAS (via the metadata.json version 
chain) for concurrency, while PG uses MVCC. How do we fit Iceberg's snapshot 
semantics inside a PG transaction?

**The design**: a transaction-scoped `Metadata Tracker`. Its shape is inspired 
by Rust iceberg-rs's `MetadataLocationTracker` and pg_lake's 
`IcebergSnapshotBuilder`.

Under this design, modifications to an Iceberg table within a transaction flow 
as follows:

```
  BEGIN
    │
    ├─ First access to t: read current metadata_location from Catalog
    │                     as initial_base
    │
    ├─ DML-1 ──→ QE writes data files
    │           QD calls agent /modify to produce an "intermediate"
    │           metadata.json (NOT committed to Catalog)
    │           tracker records: current_metadata, accumulated data_files
    │
    ├─ DML-2 ──→ Read latest metadata from Catalog (rebase check)
    │           If it changed (someone else committed), re-plan against it
    │           Produce a new "intermediate" metadata
    │
    ├─ SELECT ─→ Uses tracker.current (Read-Your-Own-Writes)
    │           For already-modified tables, triggers one more rebase
    │           (to see concurrent commits)
    │
    ├─ SAVEPOINT / ROLLBACK TO ─→ stack-style restore of accumulated files
    │
  COMMIT
    │
    └─ tracker_commit_all(): per modified table, CAS to Catalog
       On conflict, rebase and retry; up to 10 retries then PG-level abort
```

**Three rebase trigger points**:

| Scenario | When | Purpose |
|----------|------|---------|
| per-statement | End of each DML | Read-Your-Own-Writes + early 
concurrent-conflict detection |
| at-scan | SELECT on an already-modified table | Let SELECT see concurrent 
committed data |
| at-commit | PRE_COMMIT | Final merge, reduces CAS failure probability |

The resulting semantics:

- **Read Committed**: every statement sees committed concurrent transactions;
- **Read-Your-Own-Writes**: a SELECT within the transaction sees its own prior 
INSERTs;
- **ACID**: the CAS to Catalog happens only at COMMIT. On rollback, 
intermediate metadata.json files and the data files already written become 
orphans and are reclaimed by the background cleanup queue;
- **SAVEPOINT**: the tracker maintains an internal `level_history` stack, 
recording the metadata and file counts before each nested-transaction 
modification.

### 5.7 Deletion Queue: why asynchronous cleanup

DROPping an Iceberg table, replacing old files during VACUUM, orphans left 
behind by a rolled-back transaction — all of these need deletions against 
object storage.

**Why not delete synchronously**: a single Iceberg table can reference tens of 
thousands to millions of files. Synchronous deletion inside the transaction 
would make DDL block for a long time, and a mid-way failure would leave the 
system in a "metadata gone, files stranded" inconsistent state.

**The design**: an `iceberg.pg_iceberg_deletion_queue` system table plus a 
background task.

- DROP: just enqueue the metadata_location (`DELETION_TYPE_METADATA`);
- VACUUM: enqueue the paths of old data files that were replaced 
(`DELETION_TYPE_FILE`);
- The background task polls the queue, expands the referenced files from 
metadata, and deletes them in batches;
- Failed entries get `retry_count++` and are retried later, giving idempotency.

## 6. End-to-End Flows

Execution paths for each key SQL under this design.

### CREATE ICEBERG TABLE

1. PG core performs the CREATE, inserting into `pg_class / pg_attribute / 
pg_lake_table`;
2. An `OAT_POST_CREATE` hook on the QD calls the agent's `/iceberg/tables` to 
produce the initial metadata.json;
3. The returned metadata_location is written into `iceberg.pg_iceberg_metadata`.

### SELECT

1. The planner calls AM's `scan_get_am_private` and obtains the 
metadata_location "that this scan should see" (an already-modified table 
triggers one rebase);
2. The QD calls the agent's `/fragments` (with pushdown predicates) and 
receives `List<FileScanTask>`;
3. The fragment list is passed through ForeignScan plan; QEs pick up their 
share by `segindex`;
4. Each QE calls the Provider to read Parquet, applying the delete index to 
skip marked-deleted rows.

### INSERT / UPDATE / DELETE

1. QE calls Volume FDW + Provider to write data files (and, for UPDATE / 
DELETE, position-delete files);
2. QE returns file-metadata JSON to the QD;
3. QD calls `tracker.apply_updates_with_rebase`:
   - Read latest metadata from Catalog; decide whether rebase is needed;
   - Accumulate into the tracker's `data_files / delete_files`;
   - Call the agent's `/modify` to generate a new intermediate metadata.json.
4. At COMMIT, `tracker_commit_all` performs the CAS for every modified table.

### VACUUM

1. QD calls the agent's `/plan-rewrite` and receives a rewrite plan (groups 
built from min-input-files + target-file-size);
2. QEs each process one group: read old files + write one larger file;
3. QD collects results and calls the agent's `/commit-rewrite` to commit a 
RewriteFiles snapshot;
4. The paths of the replaced old files are enqueued into the deletion queue.

### DROP

1. The `OAT_DROP` hook enqueues the metadata_location into the deletion queue;
2. The row in `pg_iceberg_metadata` is removed;
3. The background cleanup task expands all files referenced by the metadata and 
deletes them in batches.

## 7. MPP Execution Model

The responsibilities are divided as follows under MPP.

### 7.1 QD vs QE responsibilities

| Responsibility | QD | QE |
|----------------|:--:|:--:|
| Call the agent (create / plan / commit) | ✓ | |
| Metadata Tracker | ✓ | |
| Fragment dispatch | ✓ | |
| Data-file read / write | | ✓ |
| Position-delete read / write | | ✓ |
| Writes to the deletion queue | ✓ | |

**Principle**: only the QD talks to the agent. Letting N QEs hit the agent in 
parallel would both make the agent a bottleneck and introduce concurrent writes 
to Iceberg snapshot state, which brings its own complexity. The parallel part 
is the data I/O.

### 7.2 Fragment dispatch

The QD places `List<FileScanTask>` into the plan tree; it is serialized and 
dispatched to QEs. Each QE picks its fragments round-robin by `segindex % 
segcount`.

The GUC `datalake.external_table_limit_segment_num` can cap the number of 
segments that participate in a scan — useful when joining with small tables to 
reduce dispatch overhead.

### 7.3 Global file-id consistency

`UPDATE / DELETE` plans may include a **Redistribute Motion** that ships a row 
from QE-i to QE-j. QE-j, when it later dereferences the ctid, must still be 
able to resolve it back to its original file.

Under this design, ctids are encoded as `<file_id, row_pos>`. To let any QE 
resolve a ctid from any origin, `BeginForeignModify` pre-populates a **global** 
file-id map using the **full** fragment list (not just the subset assigned to 
the current QE).

## 8. Pushdown & Optimization

WHERE clauses are translated through `deparse.c` into the agent's FilterNode 
tree; the agent then converts that into an Iceberg `Expression`, applying 
**partition pruning + manifest min/max filtering** at `planFiles` time. 
Operators planned for pushdown: `=, !=, >, <, >=, <=, IS [NOT] NULL, LIKE, IN, 
AND, OR`.

The Provider C++ layer then applies **row-group filtering + residual predicates 
+ column projection**.

A fragment cache (GUC `datalake.enable_iceberg_fragment_cache`, default `on`) 
caches `metadata_location + filter` → plan result within a single backend, 
avoiding repeated trips to the agent.

## 9. Concurrency with External Engines

Community Iceberg engines (Spark / Trino / …) may write the same table 
concurrently. Under this design:

- When an external engine commits, it changes the Catalog's metadata_location;
- The next CB DML's rebase will notice `global != last_base` and replan 
(accumulated files are reapplied on top of the new global);
- If replay hits an incompatible evolution (e.g. column-type conflict) → the 
agent raises an error → PG aborts the transaction and asks the user to retry.

## 10. Extensibility

**New Catalog type** (Nessie / Glue / in-house):
- Add the corresponding Iceberg `Catalog` construction on the agent side;
- Add a new `type` branch on the PG side.
Because all Iceberg semantics live in the agent, the PG-side change is minimal.

**New storage backend**:
- Add a FileSystem implementation inside libgopher;
- Have the Volume FDW recognize the new `type` and handle its connection 
parameters.

**New DML shapes** (MERGE / UPSERT): mostly planner work; the underlying "write 
data file + write position-delete" primitives can be reused.

## 11. Outside the First Release (follow-up work)

Items the first release will not cover and that will be discussed in later 
iterations:

- Only identity partitioning is planned; bucket / truncate / hour transforms 
are not supported;
- No partition-spec evolution;
- No Branch / Tag / Time Travel queries;
- Equality deletes are read-only;
- Concurrency only at Read Committed;
- The agent is single-instance by design; production deployments are expected 
to front it with a reverse proxy and multiple instances themselves;
- ANALYZE relies on record_count / bytes returned by the agent and is not 
deeply integrated with PG's column statistics;
- When an entire data file is deleted, the first release still writes a 
position-delete file and relies on a later VACUUM for cleanup — there is room 
for optimization here.

## 12. Appendix

### 12.1 Key GUCs (planned)

| GUC | Default | Description |
|-----|---------|-------------|
| `iceberg_default_catalog` | `''` | default Catalog |
| `iceberg_default_volume` | `''` | default Volume |
| `datalake_agent_server_url` | — | agent endpoint |
| `datalake.enable_iceberg_fragment_cache` | `on` | enable fragment cache |
| `datalake.iceberg_vacuum_compact_min_input_files` | `10` | min input files to 
trigger VACUUM compaction |
| `datalake.iceberg_vacuum_rewrite_target_file_size_mb` | `512` | VACUUM target 
file size (MB) |
| `datalake.iceberg_postion_deletes_threshold` | `100000` | position-delete 
threshold |
| `datalake.external_table_limit_segment_num` | `0` | cap on segments 
participating in a scan (0 = no cap) |
| `datalake.disable_filter_pushdown` | `off` | disable predicate pushdown (for 
debugging) |
| `datalake.iceberg_autovacuum` | `off` | enable autovacuum (requires restart) |
| `datalake.iceberg_autovacuum_naptime` | `600` | autovacuum interval (seconds) 
|

### 12.2 New system tables (planned)

**`iceberg.pg_iceberg_metadata`** — current metadata location for each Iceberg 
table

| Column | Type | Description |
|--------|------|-------------|
| `relid` | oid | LakeTable OID (primary key) |
| `metadata_location` | text | current metadata.json path |
| `previous_metadata_location` | text | previous version (used for CAS) |
| `is_internal` | bool | whether this is a Builtin Catalog table |
| `default_spec_id` | int4 | default partition spec |

**`iceberg.pg_iceberg_deletion_queue`** — queue of files to be cleaned up

| Column | Type | Description |
|--------|------|-------------|
| `path` | text | path to delete (primary key) |
| `table_name` | oid | originating table OID |
| `orphaned_at` | timestamptz | time enqueued |
| `retry_count` | int4 | retry count |
| `deletion_type` | int4 | `0 = FILE` / `1 = METADATA` |

### 12.3 Planned code layout

```
contrib/datalake_fdw/
├── src/am_iceberg/            Iceberg Table AM + Metadata Tracker + DDL hook
├── src/iceberg_catalog_fdw/   Catalog FDW (Polaris / Hive / Builtin)
├── src/iceberg_volume_fdw/    Volume FDW (S3 / HDFS)
├── src/provider/iceberg/      Provider C++ (Parquet I/O, delete handling)
├── src/components/agent_cli/  agent gRPC client
└── docs/                      this document

contrib/datalake_proxy/        PG bgworker that launches and supervises the 
agent jar
contrib/datalake_agent/        Java Spring Boot, wraps iceberg-java
```

---

**Suggested review focus**:

1. Whether the four-layer split (AM / Catalog FDW / Volume FDW / Agent) is 
sound;
2. The trade-off of a dedicated Java service for metadata vs. a pure C 
implementation;
3. Whether the `datalake_proxy` bgworker process model is the right way to host 
the Java agent;
4. The evolution path and compatibility story of the RPC protocol (REST first, 
gRPC later);
5. Correctness of the Metadata Tracker's rebase + CAS strategy under Read 
Committed and SAVEPOINT;
6. The MPP division of responsibilities: "agent is only talked to by the QD; 
data I/O is parallelized on QEs";
7. The necessity of the Builtin Catalog as a metadata fallback;
8. Whether splitting Catalog and Volume into two FDWs is over-abstraction;
9. The extension path for partition evolution / Branch / equality deletes.


### Rollout/Adoption Plan

_No response_

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!

GitHub link: https://github.com/apache/cloudberry/discussions/1683

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] [Proposal] Iceberg subsystem for datalake_fdw — design proposal [cloudberry]

Reply via email to