Re: [PR] Add blog for polaris and hudi integration [polaris]

via GitHub Wed, 18 Feb 2026 16:30:22 -0800


flyrain commented on code in PR #3804:
URL: https://github.com/apache/polaris/pull/3804#discussion_r2825082926



##########
blog-hudi-polaris.md:
##########
@@ -0,0 +1,280 @@
+# Using Apache Polaris as a Catalog for Apache Hudi Tables
+
+
+
+## What is Apache Hudi?
+
+[Apache Hudi](https://hudi.apache.org/) (Hadoop Upserts Deletes and 
Incrementals) is an open-source
+data lakehouse platform that brings database-like capabilities to the data 
lake. Hudi was originally
+created at Uber to solve large-scale streaming ingestion and is widely adopted 
across the industry.
+
+Key differentiators:
+
+- **Incremental processing** — first-class support for incremental reads and 
writes, enabling
+  efficient pipelines that only process changed data.
+- **Table types** — Merge-on-Read (MOR) tables optimise for write-heavy 
workloads; Copy-on-Write
+  (COW) tables optimise for read-heavy workloads.
+- **Built-in table services** — clustering, compaction, and cleaning run as 
managed services,
+  reducing operational overhead.
+- **Record-level change data capture (CDC)** — track inserts, updates, and 
deletes at the record
+  level for downstream consumers.
+- **Near-real-time upserts** — streaming ingestion with sub-minute latency for 
mutable datasets.
+
+## What is Apache Polaris (and why catalogs matter)?
+
+[Apache Polaris](https://polaris.apache.org/) is an open catalog service that 
implements the
+Apache Iceberg REST Catalog protocol as well provides a **Generic Tables API** 
for other popular table
+formats such as Apache Hudi, Delta Lake, etc. A catalog provides centralised 
metadata management, engine-agnostic table discovery, and
+a single place to enforce role-based access control (RBAC). Instead of every 
engine maintaining its
+own pointer to every table, engines ask the catalog: "where is table X, and am 
I allowed to read
+the underlying metadata and data files that belong to the table?"
+
+Under the hood, the Polaris's Spark plugin (`SparkCatalog`) detects when a 
table uses the Hudi
+provider and delegates Hudi-specific operations — creating the `.hoodie` by 
delegating to hudi's spark catalog implemenation `HoodieCatalog`, while 
persisting the table's catalog entry
+through the Polaris REST API. 
+
+Below we will run thru a simple example of how users can leverage these 
technologies for building a governed lakehouse.
+
+## Prerequisites
+
+| Requirement | Version | Notes                               |
+|---|---|-------------------------------------|
+| Java | 17+ | Required by Polaris server          |

Review Comment:
   I think Polaris requires 21+



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add blog for polaris and hudi integration [polaris]

Reply via email to