jerryshao commented on code in PR #10539: URL: https://github.com/apache/gravitino/pull/10539#discussion_r3026571457
########## design/gravitino-glue-catalog.md: ########## @@ -0,0 +1,660 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Design: AWS Glue Data Catalog Support for Apache Gravitino (Alternative A — Unified Catalog) + +## 1. Problem Statement and Goals + +### 1.1 Problem + +**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a significant gap because: + +1. **Large user base on AWS**: The majority of cloud-native data lakes run on AWS with Glue Data Catalog as the central metadata service (default for Athena, Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their Glue metadata into Gravitino's unified management layer. +2. **No native integration path**: The only workaround is pointing Gravitino's Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = thrift://...`), which is undocumented, region-limited, and cannot leverage Glue-native features (catalog ID, cross-account access, VPC endpoints). +3. **Competitive landscape**: Dremio, Trino, Spark, and Athena all support Glue with a single unified connection that presents all table types. Users expect the same from Gravitino. + +### 1.2 Goals + +After this feature is implemented: + +1. **Register AWS Glue Data Catalog in Gravitino with a single catalog**: + ```bash + gcli catalog create --name my_glue --provider glue \ + --properties aws-region=us-east-1,aws-glue-catalog-id=123456789012 + ``` + +2. **All table types in the Glue catalog are visible through a single Gravitino catalog**: + ```bash + gcli schema list --catalog my_glue + # Returns all Glue databases + + gcli table list --catalog my_glue --schema my_database + # Returns Hive tables, Iceberg tables, Delta tables, Parquet tables — everything + ``` + +3. **AWS-native authentication**: static credentials, or default credential chain (environment variables, instance profile, container credentials). + +4. **Metadata preservation**: Glue table parameters (`table_type`, `metadata_location`, `spark.sql.sources.provider`, etc.) pass through Gravitino's API layer intact, so downstream tools can correctly identify table formats. + +--- + +## 2. Background + +### 2.1 AWS Glue Data Catalog + +AWS Glue Data Catalog is a managed metadata repository storing: + +- **Databases** — logical groupings, equivalent to Gravitino schemas. +- **Tables** — metadata records containing column definitions, storage descriptors, partition keys, and user-defined parameters. +- **Views** — virtual tables defined by SQL. Glue stores them like tables with a special `TableType=VIRTUAL_VIEW` field. + +Tables in a single Glue catalog are heterogeneous — they coexist in the same database regardless of format: + +| Format | How Glue Stores It | +|---|---| +| **Hive** | Full metadata in `StorageDescriptor` (columns, SerDe, InputFormat, OutputFormat, location). The majority of tables in most Glue catalogs. | +| **Iceberg** | `Parameters["table_type"] = "ICEBERG"` and `Parameters["metadata_location"]` pointing to Iceberg metadata JSON. `StorageDescriptor.Columns` is typically empty. | +| **Delta Lake** | `Parameters["spark.sql.sources.provider"] = "delta"` and `Parameters["location"]`. | +| **Parquet/CSV/ORC** | `StorageDescriptor` with appropriate `InputFormat` / `OutputFormat`. | + +A complete Glue integration must list all of these table types without filtering. + +### 2.2 How Query Engines Use Glue + +Both Trino and Spark have native Glue support — they call the AWS Glue SDK directly, not via HMS Thrift: + +| Engine | Hive Tables on Glue | Iceberg Tables on Glue | +|---|---|---| +| **Trino** | Hive connector with `hive.metastore=glue` | Iceberg connector with `iceberg.catalog.type=glue` | +| **Spark** | Hive catalog with `AWSGlueDataCatalogHiveClientFactory` | Iceberg catalog with `catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog` | + +### 2.3 Gravitino's Catalog Plugin Architecture + +Gravitino loads catalogs as plugins at runtime. Each catalog is a self-contained Gradle module that provides: + +- A `BaseCatalog<T>` subclass registered via `META-INF/services/org.apache.gravitino.CatalogProvider`. +- A `CatalogOperations` implementation that handles the actual metadata calls. +- Property metadata classes describing the catalog's configuration schema. + +The `BaseCatalog.newOps(config)` factory method creates the `CatalogOperations` instance. Once initialized, upstream code interacts only with the `SupportsSchemas` and `TableCatalog` interfaces — the underlying implementation is completely opaque to the caller. + +--- + +## 3. Design Alternatives + +### Alternative A: New `catalog-glue` Module (This Document) + +Create a standalone `catalogs/catalog-glue/` that calls the AWS Glue SDK directly. A single Gravitino catalog backed by `provider=glue` exposes all table types from a Glue Data Catalog. + +**Pros**: +- Single catalog per Glue Data Catalog — matches how Dremio, Athena, and other tools work. +- Full control over Glue-specific behavior (pagination, catalog ID, VPC endpoints). +- No filtering: all table types are visible; `table_type` and `metadata_location` pass through intact. +- Clean foundation for Phase 2 query engine integration. + +**Cons**: +- More new code than Alternative B. +- Phase 2 query engine mixed-table support requires additional work (see Section 6). + +### Alternative B: Glue as a Metastore Type (Rejected) + +Extend existing Hive and Iceberg catalogs with `metastore-type=glue` / `catalog-backend=glue`. Users create two separate Gravitino catalogs to cover Hive and Iceberg tables from the same Glue Data Catalog. + +**Why rejected**: Industry standard (Dremio, Athena, AWS console) is one connection = all table types. Requiring two catalogs confuses users and diverges from the expected experience. + +| Dimension | Alternative A | Alternative B | +|---|---|---| +| Catalogs to register | **1** per Glue Data Catalog | 2+ (one for Hive, one for Iceberg) | +| Table visibility | **All formats in one view** | Filtered per catalog type | +| User experience | Matches Dremio, Athena, AWS console | Diverges from industry standard | +| Implementation scope | New module (~10 new files) | Modify existing files (~15 modifications) | +| Metadata passthrough | **Explicit design goal** | Not addressed | +| Long-term extensibility | Clean foundation for mixed-type routing | Requires two separate engine catalogs permanently | + +--- + +## 4. Configuration Properties + +Glue is a separate AWS service from S3. The Glue region and credentials may differ from S3 storage credentials, so Glue properties use their own `aws-*` namespace: + +| Property | Required | Default | Description | +|---|---|---|---| +| `aws-region` | Yes | — | AWS region for the Glue Data Catalog | +| `aws-access-key-id` | No | Default credential chain | AWS access key for Glue API authentication | +| `aws-secret-access-key` | No | Default credential chain | AWS secret key for Glue API authentication | +| `aws-glue-catalog-id` | Yes | — | Glue catalog ID. Required because an AWS account can have multiple Glue catalogs (e.g., default catalog and federated S3 Tables catalog). | +| `aws-glue-endpoint` | No | AWS default regional endpoint | Custom Glue endpoint URL (for VPC endpoints or LocalStack testing). | +| `default-table-format` | No | `iceberg` | Default format for tables created via Gravitino's `createTable()` API. Accepted values: `iceberg`, `hive`. | +| `table-type-filter` | No | `all` | Comma-separated list of table types exposed by `listTables()` and `loadTable()`. Accepted values: `all`, `hive`, `iceberg`, `delta`, `parquet`. Use to restrict visible table types for backwards compatibility with existing systems that cannot handle mixed-format catalogs. | + +**Authentication priority**: Static credentials (`aws-access-key-id` + `aws-secret-access-key`) → Default credential chain (environment variables, instance profile, container credentials). STS AssumeRole (`aws-role-arn`) is a future enhancement — static credentials are sufficient for the initial release, including cross-account access. + +--- + +## 5. Server-Side Design: `catalog-glue` Module + +### 5.1 Module Structure + +``` +catalogs/catalog-glue/ +├── build.gradle.kts +└── src/ + └── main/ + ├── java/org/apache/gravitino/catalog/glue/ + │ ├── GlueCatalog.java # extends BaseCatalog<GlueCatalog> + │ ├── GlueCatalogOperations.java # CatalogOperations, SupportsSchemas, TableCatalog + │ ├── GlueCatalogPropertiesMetadata.java + │ ├── GlueSchemaPropertiesMetadata.java + │ ├── GlueTablePropertiesMetadata.java + │ ├── GlueCatalogCapability.java + │ ├── GlueSchema.java # Gravitino Schema implementation + │ ├── GlueTable.java # Gravitino Table implementation + │ └── GlueClientProvider.java # AWS SDK v2 GlueClient factory + └── resources/ + └── META-INF/services/ + └── org.apache.gravitino.CatalogProvider # = o.a.g.catalog.glue.GlueCatalog +``` + +### 5.2 AWS SDK Dependency + +Use **AWS SDK v2** (`software.amazon.awssdk:glue`) — consistent with existing S3, STS, IAM, and KMS dependencies in `gradle/libs.versions.toml`, all pinned to `awssdk = "2.29.52"`. + +Add to `gradle/libs.versions.toml`: +```toml +aws-glue = { group = "software.amazon.awssdk", name = "glue", version.ref = "awssdk" } +``` + +`build.gradle.kts` dependencies: +```kotlin +dependencies { + implementation(libs.aws.glue) + implementation(libs.aws.sts) // For credential chain (already in version catalog) + compileOnly(project(":api")) + compileOnly(project(":core")) + compileOnly(project(":common")) +} +``` + +### 5.3 GlueClientProvider + +`GlueClientProvider` builds an authenticated `GlueClient` (AWS SDK v2) from catalog configuration: + +```java +public class GlueClientProvider { + + public static GlueClient buildClient(Map<String, String> config) { + GlueClientBuilder builder = GlueClient.builder() + .region(Region.of(config.get("aws-region"))); + + // Custom endpoint (VPC endpoint or LocalStack) + String endpoint = config.get("aws-glue-endpoint"); + if (endpoint != null) { + builder.endpointOverride(URI.create(endpoint)); + } + + // Static credentials (if provided), otherwise default credential chain + String accessKey = config.get("aws-access-key-id"); + String secretKey = config.get("aws-secret-access-key"); + if (accessKey != null && secretKey != null) { + builder.credentialsProvider( + StaticCredentialsProvider.create( + AwsBasicCredentials.create(accessKey, secretKey))); + } else { + builder.credentialsProvider(DefaultCredentialsProvider.create()); + } + + return builder.build(); + } +} +``` + +### 5.4 Schema Operations + +`GlueCatalogOperations` implements `SupportsSchemas` by mapping to Glue Database API: + +| Gravitino Operation | Glue API | +|---|---| +| `listSchemas(namespace)` | `GlueClient.getDatabases()` (paginated) | +| `createSchema(ident, comment, properties)` | `GlueClient.createDatabase(DatabaseInput)` | +| `loadSchema(ident)` | `GlueClient.getDatabase(name)` → `GlueSchema` | +| `alterSchema(ident, changes)` | `GlueClient.updateDatabase(name, DatabaseInput)` | +| `dropSchema(ident, cascade)` | If cascade: delete all tables first; then `GlueClient.deleteDatabase(name)` | +| `schemaExists(ident)` | `getDatabase()` + catch `EntityNotFoundException` | + +**Glue Database → Gravitino Schema mapping**: + +| Glue Field | Gravitino Field | +|---|---| +| `Database.name` | schema name | +| `Database.description` | schema comment | +| `Database.parameters` | schema properties | +| `Database.locationUri` | `"location"` property | + +### 5.5 Table Operations + +**Key design principle: present all table types by default, with opt-in filtering.** + +Unlike `HiveCatalogOperations` (which filters out Iceberg/Paimon/Hudi tables), `GlueCatalogOperations` returns every table in a Glue database by default. The `table-type-filter` catalog property allows restricting the visible table types for backwards compatibility with existing systems or older query engine versions that cannot handle mixed-format catalogs. + +#### Table Listing + +```java +@Override +public NameIdentifier[] listTables(Namespace namespace) throws NoSuchSchemaException { + String databaseName = namespace.level(namespace.length() - 1); + // Paginate through all tables — no type filter + List<Table> glueTables = new ArrayList<>(); + String nextToken = null; + do { + GetTablesResponse response = glueClient.getTables( + GetTablesRequest.builder() + .databaseName(databaseName) + .catalogId(catalogId) + .nextToken(nextToken) + .build()); + glueTables.addAll(response.tableList()); + nextToken = response.nextToken(); + } while (nextToken != null); + + return glueTables.stream() + .map(t -> NameIdentifier.of(namespace, t.name())) + .toArray(NameIdentifier[]::new); +} +``` + +#### Table Loading and Type Detection + +When loading a table, detect the format from Glue table parameters and map accordingly: + +```java +@Override +public Table loadTable(NameIdentifier ident) throws NoSuchTableException { + software.amazon.awssdk.services.glue.model.Table glueTable = + getGlueTable(ident); // calls GlueClient.getTable() + + String tableType = glueTable.parameters().getOrDefault("table_type", "").toUpperCase(); + + switch (tableType) { + case "ICEBERG": + return buildIcebergProxyTable(glueTable); // preserves metadata_location + default: + return buildHiveFormatTable(glueTable); // maps StorageDescriptor → columns + } +} +``` + +`buildHiveFormatTable()` maps `StorageDescriptor.columns()` to Gravitino `Column[]` and storage properties. `buildIcebergProxyTable()` returns a `GlueTable` with the full `parameters()` map in `properties()` — `table_type` and `metadata_location` survive intact. + +#### Glue Table → Gravitino Table Mapping + +| Glue Field | Gravitino Field | Notes | +|---|---|---| +| `Table.name` | table name | | +| `Table.description` | table comment | | +| `StorageDescriptor.columns` | `Column[]` | For Hive-format tables | +| `Table.partitionKeys` | partition columns | | +| `StorageDescriptor.location` | `"location"` property | | +| `StorageDescriptor.serdeInfo.serializationLibrary` | `"serde-lib"` property | | +| `StorageDescriptor.inputFormat` | `"input-format"` property | | +| `StorageDescriptor.outputFormat` | `"output-format"` property | | +| `Table.parameters` | `properties()` (merged) | Includes `table_type`, `metadata_location`, etc. — all pass through | +| `Table.tableType` | `"external-table"` / `"managed-table"` property | `EXTERNAL_TABLE` vs `MANAGED_TABLE` | + +**Metadata passthrough guarantee**: Every key-value pair in `Table.parameters` is included in the Gravitino `Table.properties()` map unchanged. This ensures `table_type=ICEBERG`, `metadata_location=s3://...`, `spark.sql.sources.provider=delta`, and any other format indicators survive Gravitino's metadata proxy layer. + +#### Table CRUD + +| Gravitino Operation | Glue API | Notes | +|---|---|---| +| `createTable(ident, columns, comment, properties, ...)` | `GlueClient.createTable(TableInput)` | Format determined by `default-table-format` (default: `iceberg`) | +| `alterTable(ident, changes)` | `GlueClient.updateTable(TableInput)` | | +| `dropTable(ident)` | `GlueClient.deleteTable(name)` | | +| `purgeTable(ident)` | `GlueClient.deleteTable(name)` | Phase 1: same as drop (no data deletion from S3 — query engine responsibility) | +| `tableExists(ident)` | `getTable()` + catch `EntityNotFoundException` | | + +**Default table format**: When `createTable()` is called without an explicit `table_type` property, `GlueCatalogOperations` uses the `default-table-format` catalog property to determine the format. The default is `iceberg` — `createTable()` builds an Iceberg-compatible `TableInput` (setting `Parameters["table_type"]="ICEBERG"` and `Parameters["metadata_location"]` to the initial metadata path). If `default-table-format=hive`, a standard Hive `StorageDescriptor`-based `TableInput` is produced instead. Users can also override per-table by setting `table_type` explicitly in the table properties passed to `createTable()`. + +### 5.6 Views + +Glue stores views as tables with `TableType=VIRTUAL_VIEW` and `ViewOriginalText` / `ViewExpandedText` fields. Gravitino's `ViewCatalog` interface is not yet fully implemented; how `catalog-glue` should expose Glue views will be determined once the Gravitino View API is in place. + +In Phase 1, views are hidden — `listTables()` and `loadTable()` filter out entries where `TableType=VIRTUAL_VIEW`, making them invisible through the Gravitino API. + +### 5.7 Architecture Overview + +``` + Gravitino Server + | + provider=glue + | + GlueCatalogOperations + | + GlueClientProvider + | + AWS SDK v2 GlueClient + | + AWS Glue Data Catalog (us-east-1) + | + +-------+-------+-------+-------+ + | | | | | + Hive Iceberg Delta Parquet Views + tables tables tables tables + (StorageDescriptor) (table_type=ICEBERG) (all parameters pass through) +``` + +**End-to-end data flow**: + +``` +# Registration +gcli catalog create --name my_glue --provider glue \ + --properties aws-region=us-east-1,aws-glue-catalog-id=123456789012 + → GravitinoCatalogManager.createCatalog() + → CatalogPluginLoader loads catalog-glue plugin + → GlueCatalog.newOps(config) + → GlueCatalogOperations.initialize(config) + → GlueClientProvider.buildClient(config) → GlueClient + +# Metadata query +gcli table list --catalog my_glue --schema analytics + → GlueCatalogOperations.listTables(Namespace.of("my_glue", "analytics")) + → GlueClient.getTables(databaseName="analytics", catalogId="123456789012") + → Returns ALL non-view tables: [orders (Hive), events (Iceberg), sessions (Parquet), ...] + // VIRTUAL_VIEW entries are filtered out at this layer (see Section 5.6) + +gcli table details --catalog my_glue --schema analytics --table events + → GlueCatalogOperations.loadTable(NameIdentifier...) + → GlueClient.getTable("analytics", "events") + → Parameters["table_type"] = "ICEBERG" + → Returns GlueTable with properties(): + {"table_type": "ICEBERG", "metadata_location": "s3://bucket/events/metadata/...", ...} +``` + +--- + +## 6. Query Engine Integration (Phase 2) + +Phase 1 delivers full metadata API support — all table types are visible and queryable through Gravitino. Phase 2 extends query engine connectors so that engines can actually execute queries against mixed-format Glue databases through a single Gravitino catalog. + +### 6.1 Trino + +A Glue database typically contains Hive, Iceberg, Delta, and Parquet tables coexisting. Gravitino's Trino connector must route queries to the correct internal Trino connector implementation per table. Three approaches were evaluated. + +#### Approach 1: Per-Table Dispatch Inside GravitinoConnector + +Extend `GravitinoConnector` to hold two internal connector instances simultaneously (Hive and Iceberg). A new `GlueTableHandle` carries the table's format, and every connector method (`getMetadata`, `getSplitManager`, `getPageSourceProvider`, etc.) routes to the correct internal connector. + +**Pros**: Works on all Trino versions (435+); single Gravitino catalog maps to a single Trino catalog. + +**Cons**: High implementation cost (~9 new classes); `GlueMetadata` alone must override 20+ methods with identical routing boilerplate. Transaction lifecycle is complex — `beginTransaction` must open transactions on both connectors simultaneously. Session properties and table properties from Hive and Iceberg connectors may conflict. Every Gravitino Trino version subproject (435–478+) must carry its own copy. Essentially reimplements what Trino already provides natively in its Lakehouse connector. + +#### Approach 2: Trino Lakehouse Connector (Recommended) + +The Trino [Lakehouse connector](https://trino.io/docs/current/connector/lakehouse.html) (`connector.name=lakehouse`) natively handles all table formats (Hive, Iceberg, Delta Lake, Hudi) through a single connector instance, with AWS Glue as a supported metastore backend. + +A new `GlueConnectorAdapter` maps the Gravitino `glue` catalog to a Trino Lakehouse catalog: + +``` +Gravitino catalog (provider=glue) + → GlueConnectorAdapter.buildInternalConnectorConfig() + → { "connector.name": "lakehouse", + "hive.metastore": "glue", + "hive.metastore.glue.region": "us-east-1", + "hive.metastore.glue.catalogid": "123456789012", + "hive.metastore.glue.aws-access-key": "<aws-access-key-id>", + "hive.metastore.glue.aws-secret-key": "<aws-secret-access-key>" } + // aws-access-key-id / aws-secret-access-key omitted when using default credential chain + // aws-glue-endpoint maps to hive.metastore.glue.endpoint-url (VPC / LocalStack) + → Trino Lakehouse connector handles all table formats natively +``` + +**Pros**: Minimal Gravitino code (one new `GlueConnectorAdapter`, ~30 lines); no routing logic, no transaction coordination, no property conflicts. Single Gravitino catalog = single Trino catalog. Future table format additions are automatically supported as Trino extends the Lakehouse connector. + +**Cons**: Requires **Trino ≥ 477**. Gravitino currently supports Trino 435–478; versions 435–476 cannot use this approach. Review Comment: I think we can make this feature only supported in version 477 above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
