Re: [PR] Design: AWS Glue Data Catalog connector design document [gravitino]

via GitHub Thu, 02 Apr 2026 01:08:52 -0700


jerryshao commented on code in PR #10539:
URL: https://github.com/apache/gravitino/pull/10539#discussion_r3026560004



##########
design/gravitino-glue-catalog.md:
##########
@@ -0,0 +1,660 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Design: AWS Glue Data Catalog Support for Apache Gravitino (Alternative A — 
Unified Catalog)
+
+## 1. Problem Statement and Goals
+
+### 1.1 Problem
+
+**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a 
significant gap because:
+
+1. **Large user base on AWS**: The majority of cloud-native data lakes run on 
AWS with Glue Data Catalog as the central metadata service (default for Athena, 
Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their 
Glue metadata into Gravitino's unified management layer.
+2. **No native integration path**: The only workaround is pointing Gravitino's 
Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = 
thrift://...`), which is undocumented, region-limited, and cannot leverage 
Glue-native features (catalog ID, cross-account access, VPC endpoints).
+3. **Competitive landscape**: Dremio, Trino, Spark, and Athena all support 
Glue with a single unified connection that presents all table types. Users 
expect the same from Gravitino.
+
+### 1.2 Goals
+
+After this feature is implemented:
+
+1. **Register AWS Glue Data Catalog in Gravitino with a single catalog**:
+   ```bash
+   gcli catalog create --name my_glue --provider glue \
+     --properties aws-region=us-east-1,aws-glue-catalog-id=123456789012
+   ```
+
+2. **All table types in the Glue catalog are visible through a single 
Gravitino catalog**:
+   ```bash
+   gcli schema list --catalog my_glue
+   # Returns all Glue databases
+
+   gcli table list --catalog my_glue --schema my_database
+   # Returns Hive tables, Iceberg tables, Delta tables, Parquet tables — 
everything
+   ```
+
+3. **AWS-native authentication**: static credentials, or default credential 
chain (environment variables, instance profile, container credentials).
+
+4. **Metadata preservation**: Glue table parameters (`table_type`, 
`metadata_location`, `spark.sql.sources.provider`, etc.) pass through 
Gravitino's API layer intact, so downstream tools can correctly identify table 
formats.
+
+---
+
+## 2. Background
+
+### 2.1 AWS Glue Data Catalog
+
+AWS Glue Data Catalog is a managed metadata repository storing:
+
+- **Databases** — logical groupings, equivalent to Gravitino schemas.
+- **Tables** — metadata records containing column definitions, storage 
descriptors, partition keys, and user-defined parameters.
+- **Views** — virtual tables defined by SQL. Glue stores them like tables with 
a special `TableType=VIRTUAL_VIEW` field.
+
+Tables in a single Glue catalog are heterogeneous — they coexist in the same 
database regardless of format:
+
+| Format | How Glue Stores It |
+|---|---|
+| **Hive** | Full metadata in `StorageDescriptor` (columns, SerDe, 
InputFormat, OutputFormat, location). The majority of tables in most Glue 
catalogs. |
+| **Iceberg** | `Parameters["table_type"] = "ICEBERG"` and 
`Parameters["metadata_location"]` pointing to Iceberg metadata JSON. 
`StorageDescriptor.Columns` is typically empty. |
+| **Delta Lake** | `Parameters["spark.sql.sources.provider"] = "delta"` and 
`Parameters["location"]`. |
+| **Parquet/CSV/ORC** | `StorageDescriptor` with appropriate `InputFormat` / 
`OutputFormat`. |
+
+A complete Glue integration must list all of these table types without 
filtering.
+
+### 2.2 How Query Engines Use Glue
+
+Both Trino and Spark have native Glue support — they call the AWS Glue SDK 
directly, not via HMS Thrift:
+
+| Engine | Hive Tables on Glue | Iceberg Tables on Glue |
+|---|---|---|
+| **Trino** | Hive connector with `hive.metastore=glue` | Iceberg connector 
with `iceberg.catalog.type=glue` |
+| **Spark** | Hive catalog with `AWSGlueDataCatalogHiveClientFactory` | 
Iceberg catalog with `catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog` |
+
+### 2.3 Gravitino's Catalog Plugin Architecture
+
+Gravitino loads catalogs as plugins at runtime. Each catalog is a 
self-contained Gradle module that provides:
+
+- A `BaseCatalog<T>` subclass registered via 
`META-INF/services/org.apache.gravitino.CatalogProvider`.
+- A `CatalogOperations` implementation that handles the actual metadata calls.
+- Property metadata classes describing the catalog's configuration schema.
+
+The `BaseCatalog.newOps(config)` factory method creates the 
`CatalogOperations` instance. Once initialized, upstream code interacts only 
with the `SupportsSchemas` and `TableCatalog` interfaces — the underlying 
implementation is completely opaque to the caller.
+
+---
+
+## 3. Design Alternatives
+
+### Alternative A: New `catalog-glue` Module (This Document)
+
+Create a standalone `catalogs/catalog-glue/` that calls the AWS Glue SDK 
directly. A single Gravitino catalog backed by `provider=glue` exposes all 
table types from a Glue Data Catalog.
+
+**Pros**:
+- Single catalog per Glue Data Catalog — matches how Dremio, Athena, and other 
tools work.
+- Full control over Glue-specific behavior (pagination, catalog ID, VPC 
endpoints).
+- No filtering: all table types are visible; `table_type` and 
`metadata_location` pass through intact.
+- Clean foundation for Phase 2 query engine integration.
+
+**Cons**:
+- More new code than Alternative B.
+- Phase 2 query engine mixed-table support requires additional work (see 
Section 6).
+
+### Alternative B: Glue as a Metastore Type (Rejected)
+
+Extend existing Hive and Iceberg catalogs with `metastore-type=glue` / 
`catalog-backend=glue`. Users create two separate Gravitino catalogs to cover 
Hive and Iceberg tables from the same Glue Data Catalog.
+
+**Why rejected**: Industry standard (Dremio, Athena, AWS console) is one 
connection = all table types. Requiring two catalogs confuses users and 
diverges from the expected experience.
+
+| Dimension | Alternative A | Alternative B |
+|---|---|---|
+| Catalogs to register | **1** per Glue Data Catalog | 2+ (one for Hive, one 
for Iceberg) |
+| Table visibility | **All formats in one view** | Filtered per catalog type |
+| User experience | Matches Dremio, Athena, AWS console | Diverges from 
industry standard |
+| Implementation scope | New module (~10 new files) | Modify existing files 
(~15 modifications) |
+| Metadata passthrough | **Explicit design goal** | Not addressed |
+| Long-term extensibility | Clean foundation for mixed-type routing | Requires 
two separate engine catalogs permanently |
+
+---
+
+## 4. Configuration Properties
+
+Glue is a separate AWS service from S3. The Glue region and credentials may 
differ from S3 storage credentials, so Glue properties use their own `aws-*` 
namespace:
+
+| Property | Required | Default | Description |
+|---|---|---|---|
+| `aws-region` | Yes | — | AWS region for the Glue Data Catalog |
+| `aws-access-key-id` | No | Default credential chain | AWS access key for 
Glue API authentication |
+| `aws-secret-access-key` | No | Default credential chain | AWS secret key for 
Glue API authentication |
+| `aws-glue-catalog-id` | Yes | — | Glue catalog ID. Required because an AWS 
account can have multiple Glue catalogs (e.g., default catalog and federated S3 
Tables catalog). |
+| `aws-glue-endpoint` | No | AWS default regional endpoint | Custom Glue 
endpoint URL (for VPC endpoints or LocalStack testing). |
+| `default-table-format` | No | `iceberg` | Default format for tables created 
via Gravitino's `createTable()` API. Accepted values: `iceberg`, `hive`. |
+| `table-type-filter` | No | `all` | Comma-separated list of table types 
exposed by `listTables()` and `loadTable()`. Accepted values: `all`, `hive`, 
`iceberg`, `delta`, `parquet`. Use to restrict visible table types for 
backwards compatibility with existing systems that cannot handle mixed-format 
catalogs. |
+

Review Comment:
   Are these catalog properties?  `aws-access-key-id` and 
`aws-secret-access-key` are sensitive, and should not be visible to the readers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Design: AWS Glue Data Catalog connector design document [gravitino]

Reply via email to