flyrain commented on code in PR #3804: URL: https://github.com/apache/polaris/pull/3804#discussion_r2825082926
########## blog-hudi-polaris.md: ########## @@ -0,0 +1,280 @@ +# Using Apache Polaris as a Catalog for Apache Hudi Tables + + + +## What is Apache Hudi? + +[Apache Hudi](https://hudi.apache.org/) (Hadoop Upserts Deletes and Incrementals) is an open-source +data lakehouse platform that brings database-like capabilities to the data lake. Hudi was originally +created at Uber to solve large-scale streaming ingestion and is widely adopted across the industry. + +Key differentiators: + +- **Incremental processing** — first-class support for incremental reads and writes, enabling + efficient pipelines that only process changed data. +- **Table types** — Merge-on-Read (MOR) tables optimise for write-heavy workloads; Copy-on-Write + (COW) tables optimise for read-heavy workloads. +- **Built-in table services** — clustering, compaction, and cleaning run as managed services, + reducing operational overhead. +- **Record-level change data capture (CDC)** — track inserts, updates, and deletes at the record + level for downstream consumers. +- **Near-real-time upserts** — streaming ingestion with sub-minute latency for mutable datasets. + +## What is Apache Polaris (and why catalogs matter)? + +[Apache Polaris](https://polaris.apache.org/) is an open catalog service that implements the +Apache Iceberg REST Catalog protocol as well provides a **Generic Tables API** for other popular table +formats such as Apache Hudi, Delta Lake, etc. A catalog provides centralised metadata management, engine-agnostic table discovery, and +a single place to enforce role-based access control (RBAC). Instead of every engine maintaining its +own pointer to every table, engines ask the catalog: "where is table X, and am I allowed to read +the underlying metadata and data files that belong to the table?" + +Under the hood, the Polaris's Spark plugin (`SparkCatalog`) detects when a table uses the Hudi +provider and delegates Hudi-specific operations — creating the `.hoodie` by delegating to hudi's spark catalog implemenation `HoodieCatalog`, while persisting the table's catalog entry +through the Polaris REST API. + +Below we will run thru a simple example of how users can leverage these technologies for building a governed lakehouse. + +## Prerequisites + +| Requirement | Version | Notes | +|---|---|-------------------------------------| +| Java | 17+ | Required by Polaris server | Review Comment: I think Polaris requires 21+ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
