jerryshao commented on code in PR #10724: URL: https://github.com/apache/gravitino/pull/10724#discussion_r3079443742
########## design-docs/gravitino-logical-view-management.md: ########## @@ -0,0 +1,1051 @@ +# Design of Logical View Management in Gravitino + +## Background + +In modern data lakehouse architectures, views serve as a fundamental abstraction for data access, security enforcement, and query simplification. Organizations leverage multiple query engines (Trino, Spark, Hive) to access the same underlying data, but view management across these heterogeneous systems presents significant challenges: + +- **Portability Gap**: A view created in Trino cannot be read by Spark, and vice versa, due to differences in SQL dialects and metadata storage formats. +- **Fragmented Governance**: Views are scattered across different metastores (HMS, Iceberg REST Catalog, engine-specific stores), making unified access control and auditing difficult. +- **Inconsistent Security**: Each engine implements its own security model (definer/invoker), leading to inconsistent access control behavior across the data platform. + +Apache Gravitino, as a unified metadata management system, is well-positioned to address these challenges by providing centralized view management with multi-engine compatibility. + +--- + +## Goals + +1. **Multi-Engine Compatibility**: Views managed by Gravitino are visible and manageable across engines. Multi-dialect SQL representation storage enables cross-engine view sharing. + +2. **Unified View Management**: Provide standard CRUD operations for views: + - Create view + - Get/List views + - Alter view (update SQL, add representations, modify properties) + - Drop view + +3. **Capability-Driven Storage Strategy**: Automatically select the optimal storage strategy based on each catalog's capabilities — no user-facing storage mode configuration needed. Gravitino transparently handles delegation, extension, and full management per catalog type. + +4. **Access Control Integration**: Integrate with Gravitino's existing access control framework to provide metadata-level privileges (CREATE_VIEW, SELECT_VIEW, DROP_VIEW). Data-level access control remains the responsibility of the underlying compute engines. + +5. **Audit Support**: View operations should be auditable with complete audit information. + +6. **Event System Integration**: View operations should emit events for users to hook into. + +--- + +## Non-Goals + +1. **Materialized Views**: This design focuses on logical views only. Materialized views with physical storage are out of scope. IRC-based materialized views are a planned follow-on that builds on the logical view infrastructure established here; they represent a meaningful product differentiator that no other open metadata catalog currently offers. + +2. **Temporary Views**: Session-scoped temporary views are managed by engines themselves and don't require persistent management. + +3. **SQL Transpilation**: Gravitino will not automatically convert SQL between dialects. Users are responsible for providing correct SQL representations for each target dialect. + +4. **Query Execution**: Gravitino manages view metadata only. Actual query execution is handled by the compute engines. + +--- + +## Proposal + +### Namespace + +Views are registered under a specified schema in relational catalogs, following the three-level namespace hierarchy: + +``` +metalake + └── catalog (relational) + └── schema + └── view +``` + +This is consistent with Gravitino's existing namespace design for tables and functions. **Views and tables share the same namespace within a schema** — a view and a table cannot have the same name under the same schema. This follows the standard behavior of most relational databases (MySQL, PostgreSQL, Hive, etc.). + +--- + +### View Metadata Model + +#### Core View Structure + +``` +View +├── name: string # View name (unique within schema, shared namespace with tables) +├── comment: string # Optional description +├── columns: array<ViewColumn> # View schema definition +│ └── ViewColumn +│ ├── name: string +│ ├── type: DataType +│ └── comment: string (optional) +├── representations: array<Representation> # Multi-dialect view definitions (one per dialect) +│ └── Representation +│ ├── type: string # Representation type, currently only "sql" +│ └── SQLRepresentation (type="sql") +│ ├── dialect: string # e.g., "trino", "spark", "hive" (unique within a view) +│ ├── sql: string # The view definition SQL +│ ├── defaultCatalog: string # Default catalog for unqualified refs +│ └── defaultSchema: string # Default schema for unqualified refs +├── securityConfig: SecurityConfig +│ └── securityMode: enum # DEFINER | INVOKER +├── properties: map<string, string> # Extensible key-value properties +└── auditInfo: AuditInfo # Creation/modification timestamps and users +``` + +**Field Details:** + +- **columns**: The view's output schema definition. Required. + - Typically provided by the engine connector, not by users directly — when a user executes `CREATE VIEW AS SELECT ...`, the engine infers the output schema and passes it to Gravitino. + - Gravitino does not parse SQL or validate column-SQL consistency. Mismatches surface at query time in the engine. + - Not auto-updated when underlying table schemas change — users must explicitly alter the view. + +- **representations**: The multi-dialect definitions of a view. At least one required; each dialect unique per view. + - Designed for extensibility via a `type` field. Currently only `type="sql"` (`SQLRepresentation`) is defined. + - Future versions may introduce other representation types (e.g., dataframe-based definitions) without breaking the existing model. + +- **dialect**: A free-form string identifying the SQL dialect (e.g., `"trino"`, `"spark"`, `"hive"`, `"flink"`). + - Not a fixed enum — new dialects can be added without API changes. + - Gravitino provides a set of standard dialect constants (e.g., `Dialects.TRINO`, `Dialects.SPARK`) for engine connectors to use, reducing the risk of typos while preserving extensibility. + - Engine connectors use this value to locate the appropriate representation when loading a view. + +- **defaultCatalog / defaultSchema**: The catalog and schema context in which the SQL was authored. Optional, per-representation. + - Used by engines to resolve unqualified table references (e.g., `FROM orders` → `FROM defaultCatalog.defaultSchema.orders`). + - View SQL may contain cross-catalog references (e.g., `catalog_a.schema.table JOIN catalog_b.schema.table`). The SQL is stored as-is; neither Gravitino, the IRC, nor the HMS validates, rewrites, or transforms view SQL at any point. The compute engine is responsible for resolving and executing cross-catalog queries at runtime. + +- **securityConfig**: Declares the security execution model of the view. This is a metadata property stored by Gravitino and **passed through to the compute engine** — Gravitino does not enforce it. Whether it takes effect depends on the engine's capability (e.g., MySQL natively supports DEFINER/INVOKER; Iceberg and Hive do not). + - `DEFINER`: the engine should execute the view query with the view owner's privileges. + - `INVOKER`: the engine should execute the view query with the querying user's privileges. + +### Capability-Driven Storage Strategy + +The storage strategy is driven by what users actually care about, rather than exposing implementation details: + +1. **Usability** — Can I create and manage views through Gravitino, regardless of the underlying catalog? +2. **Multi-engine** — Can my views be used by Spark, Trino, and Flink simultaneously? +3. **Interoperability** — Can I see pre-existing views in the underlying catalog? Can views created via Gravitino be seen by other tools that access the underlying catalog directly? + +Rather than exposing catalog differences to users via a storage mode configuration, Gravitino automatically selects the optimal strategy based on each catalog's native view capability. Catalogs fall into three tiers: + +- **Full view support** (Iceberg, Paimon) — Natively support multi-dialect view storage and CRUD operations. +- **Single-dialect view support** (HMS) — Can store views, but each engine uses its own private format. Gravitino provides format-aware delegation to normalize these differences. +- **Read-only view support** (JDBC catalogs) — Can store views, but engine dialect mismatch makes write operations impractical. Gravitino provides read-only discovery in v1. +- **No view support** (Hudi, Generic/Delta/Lance, etc.) — Do not support view operations at all. + +The table below maps each tier to the corresponding storage strategy: + +| Catalog Type | Strategy | Description | +|-------------|----------|-------------| +| **Iceberg** | Complete delegation | All CRUD delegated to Iceberg catalog (native multi-dialect support) | +| **Paimon** | Complete delegation | All CRUD delegated to Paimon catalog (native multi-dialect support, Hive/REST backends) | +| **HMS** | Format-aware delegation | Gravitino detects engine-specific formats, normalizes view definitions, and writes new views in Gravitino standard format (see [HMS View Behavior](#hms-view-behavior) for details) | +| **JDBC** (MySQL, PostgreSQL, Doris, StarRocks, etc.) | Read-only discovery (v1) | Pre-existing database views are auto-discovered and exposed through Gravitino; CREATE VIEW not supported in v1 (future enhancement) | +| **Catalogs without native view support** (Hudi, Generic/Delta/Lance, etc.) | Fully Gravitino-managed | All view metadata stored in Gravitino DB | Review Comment: You need to add Glue into consideration. One thing is that the current Glue catalog implementation is a multiple format catalog. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
