mchades commented on code in PR #10724: URL: https://github.com/apache/gravitino/pull/10724#discussion_r3085046492
########## design-docs/gravitino-logical-view-management.md: ########## @@ -0,0 +1,1051 @@ +# Design of Logical View Management in Gravitino + +## Background + +In modern data lakehouse architectures, views serve as a fundamental abstraction for data access, security enforcement, and query simplification. Organizations leverage multiple query engines (Trino, Spark, Hive) to access the same underlying data, but view management across these heterogeneous systems presents significant challenges: + +- **Portability Gap**: A view created in Trino cannot be read by Spark, and vice versa, due to differences in SQL dialects and metadata storage formats. +- **Fragmented Governance**: Views are scattered across different metastores (HMS, Iceberg REST Catalog, engine-specific stores), making unified access control and auditing difficult. +- **Inconsistent Security**: Each engine implements its own security model (definer/invoker), leading to inconsistent access control behavior across the data platform. + +Apache Gravitino, as a unified metadata management system, is well-positioned to address these challenges by providing centralized view management with multi-engine compatibility. + +--- + +## Goals + +1. **Multi-Engine Compatibility**: Views managed by Gravitino are visible and manageable across engines. Multi-dialect SQL representation storage enables cross-engine view sharing. + +2. **Unified View Management**: Provide standard CRUD operations for views: + - Create view + - Get/List views + - Alter view (update SQL, add representations, modify properties) + - Drop view + +3. **Capability-Driven Storage Strategy**: Automatically select the optimal storage strategy based on each catalog's capabilities — no user-facing storage mode configuration needed. Gravitino transparently handles delegation, extension, and full management per catalog type. + +4. **Access Control Integration**: Integrate with Gravitino's existing access control framework to provide metadata-level privileges (CREATE_VIEW, SELECT_VIEW, DROP_VIEW). Data-level access control remains the responsibility of the underlying compute engines. + +5. **Audit Support**: View operations should be auditable with complete audit information. + +6. **Event System Integration**: View operations should emit events for users to hook into. + +--- + +## Non-Goals + +1. **Materialized Views**: This design focuses on logical views only. Materialized views with physical storage are out of scope. IRC-based materialized views are a planned follow-on that builds on the logical view infrastructure established here; they represent a meaningful product differentiator that no other open metadata catalog currently offers. + +2. **Temporary Views**: Session-scoped temporary views are managed by engines themselves and don't require persistent management. + +3. **SQL Transpilation**: Gravitino will not automatically convert SQL between dialects. Users are responsible for providing correct SQL representations for each target dialect. + +4. **Query Execution**: Gravitino manages view metadata only. Actual query execution is handled by the compute engines. + +--- + +## Proposal + +### Namespace + +Views are registered under a specified schema in relational catalogs, following the three-level namespace hierarchy: + +``` +metalake + └── catalog (relational) + └── schema + └── view +``` + +This is consistent with Gravitino's existing namespace design for tables and functions. **Views and tables share the same namespace within a schema** — a view and a table cannot have the same name under the same schema. This follows the standard behavior of most relational databases (MySQL, PostgreSQL, Hive, etc.). + +--- + +### View Metadata Model + +#### Core View Structure + +``` +View +├── name: string # View name (unique within schema, shared namespace with tables) +├── comment: string # Optional description +├── columns: array<ViewColumn> # View schema definition +│ └── ViewColumn +│ ├── name: string +│ ├── type: DataType +│ └── comment: string (optional) +├── representations: array<Representation> # Multi-dialect view definitions (one per dialect) +│ └── Representation +│ ├── type: string # Representation type, currently only "sql" +│ └── SQLRepresentation (type="sql") +│ ├── dialect: string # e.g., "trino", "spark", "hive" (unique within a view) +│ ├── sql: string # The view definition SQL +│ ├── defaultCatalog: string # Default catalog for unqualified refs +│ └── defaultSchema: string # Default schema for unqualified refs +├── securityConfig: SecurityConfig Review Comment: Makes sense. I've flattened `securityConfig`: `SecurityConfig` to a top-level `securityMode` field directly on the View model, since it currently only contains a single field. Updated all references throughout the document (data model, REST API, Java/Python APIs, Trino integration code, and the security description). ########## design-docs/gravitino-logical-view-management.md: ########## @@ -0,0 +1,1051 @@ +# Design of Logical View Management in Gravitino + +## Background + +In modern data lakehouse architectures, views serve as a fundamental abstraction for data access, security enforcement, and query simplification. Organizations leverage multiple query engines (Trino, Spark, Hive) to access the same underlying data, but view management across these heterogeneous systems presents significant challenges: + +- **Portability Gap**: A view created in Trino cannot be read by Spark, and vice versa, due to differences in SQL dialects and metadata storage formats. +- **Fragmented Governance**: Views are scattered across different metastores (HMS, Iceberg REST Catalog, engine-specific stores), making unified access control and auditing difficult. +- **Inconsistent Security**: Each engine implements its own security model (definer/invoker), leading to inconsistent access control behavior across the data platform. + +Apache Gravitino, as a unified metadata management system, is well-positioned to address these challenges by providing centralized view management with multi-engine compatibility. + +--- + +## Goals + +1. **Multi-Engine Compatibility**: Views managed by Gravitino are visible and manageable across engines. Multi-dialect SQL representation storage enables cross-engine view sharing. Review Comment: I've updated the Non-Goals section to acknowledge the ongoing Iceberg Materialized View effort. The current logical view infrastructure (multi-dialect representations, unified metadata model, catalog capability detection) is designed to serve as the foundation for future MV support. We'll take the Iceberg MV work into consideration when we start the MV design. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
