betodealmeida opened a new issue, #35003:
URL: https://github.com/apache/superset/issues/35003

   # SIP: Semantic Layer Support in Apache Superset
   
   ## Abstract
   
   This proposal introduces changes to Apache Superset to better support 
semantic layers and external data modeling systems. The changes include (1) the 
definition of an “Explorable” interface (a Python protocol), and (2) the 
introduction of a new class of connections for semantic layers and similar 
systems.
   
   ## Motivation
   
   Semantic layers are a powerful way to expose well curated metric and related 
dimensions to users, allowing for an improved user experience when presenting 
data: instead of focusing on datasets, semantic layers in general operate on a 
higher level of abstraction, exposing curated **metrics** as first class 
citizens. Once a user has selected one or more metrics that they're interested 
in, most modern semantic layers allow them to slice and dice metrics by 
presenting associated **dimensions**, automatically performing joins between 
the underlying data sources. This workflow, where metrics are curated and their 
dimensions are freely available, allows users to focus on the metrics that 
matter to them, while providing confidence that the underlying data is correct 
and well defined.
   
   Because Superset is fundamentally dataset-centric, integrations with 
semantic layers have been timid so far. When they exist, **they usually 
represent the semantic layer as pseudo-database**, exposing one or more 
**pseudo-datasets** that represent models in the semantic layer. 
[Cube](https://cube.dev/), for example, uses the Postgres wire protocol to 
expose cubes as pseudo-tables that can be queried in Superset. Minerva, an 
inhouse semantic layer from AirBnB, uses a different approach, exposing all 
metrics and dimensions as a single dataset, [with custom overrides to indicate 
to the users which metrics and dimensions are 
compatible](https://medium.com/airbnb-engineering/supercharging-apache-superset-b1a2393278bd).
 Other experimental integrations (with MetricFlow and DataJunction) used a 
similar approach, though they haven’t been contributed to  OSS yet.
   
   There are a couple limitations in Superset's architecture that create 
friction when integrating with semantic layers and modern data platforms:
   
   ### Datasets *are* the Superset Semantic Layer
   
   In order to explore data in Superset we need some kind of semantic layer 
that informs us which columns are available, which ones can be filtered, which 
ones are groupable, etc. In Superset, that semantic layer is the **dataset**, a 
thin abstraction that allows users to define metrics, declare derived columns, 
and add metadata that informs the UI and prevents expensive queries from 
running (group bys on high cardinality columns, for example).
   
   Because the dataset is the native semantic layer, adding an external source 
from a semantic layer as a dataset to Superset is unlikely to work as expected, 
since we’re adding a second semantic layer on top of the first one. For 
example, the semantic layer might not allow for adhoc metrics or computed 
columns, making it incompatible with the Superset dataset editor, as well as 
making the experience in Explore unintuitive and error-prone.
   
   For these “semantic datasets” (or “semantic models”) we likely want to 
disable the Superset semantic layer, since the metadata is already defined 
externally:
   
   1. Users shouldn't be able to add metrics nor calculated columns when 
editing them, unless the semantic layer supports adhoc expressions.
   2. Similar to the "Sync columns from source", metrics should have a button 
to "Sync metrics from source", to fetch new or updated metrics from the 
semantic layer.
   3. When explored, users should not be allowed to add adhoc metrics, derived 
columns, or use "Custom SQL", unless the semantic layer supports adhoc 
expressions.
   4. When explored, not all metrics and dimensions might be compatible, 
requiring a mechanism for disabling metrics/columns as the user selects 
metrics/columns, similar to the inhouse approach used by Minerva.
   
   This suggests that these **semantic models should not be represented as 
standard datasets in Superset**, given that in order to make it work we need to 
remove all the value that datasets provide — the semantics.
   
   ### Query generation
   
   The current flow for chart creation is tightly coupled not only with SQL but 
with SQLAlchemy. The frontend sends a "semantic request" to the backend, via 
the `QueryObject`, indicating which columns, metrics, filters, and other 
parameters the user has selected. The backend then generates a SQLAlchemy query 
from this payload. The query generation is highly dependent on the specific 
database engine, so the `get_sqla_query` method inspects several attributes of 
the DB engine spec. Once this query is generated, it is transpiled to the 
target dialect using SQLAlchemy, and passed to the DB engine spec for execution.
   
   This approach works reasonably well for traditional databases, but it 
creates friction when integrating with semantic layers that do not use SQL or 
SQLAlchemy. For example, if a semantic layer uses GraphQL or REST APIs, the 
current flow requires parsing the generated SQL and building a custom request. 
This is the case for the experimental MetricFlow DB engine spec, which has this 
flow:
   
   ```
   Explore controls
          ↓
     QueryObject
          ↓
   SQLAlchemy query
          ↓
      pseudo-SQL
          ↓
   Shillelagh handler
          ↓
    GraphQL request
   ```
   
   For Snowflake, even though it exposes a SQL interface to semantic views, the 
flow looks like this:
   
   ```
   Explore controls
          ↓
     QueryObject
          ↓
   SQLAlchemy query
          ↓
      pseudo-SQL
          ↓
     sqlglot AST
          ↓
      actual SQL
          ↓
       execute
   ```
   
   The SQL parsing step is necessary for Snowflake because Explore builds the 
query on top of either a table or a subquery, but in Snowflake it should be a 
UDTF (user defined table function):
   
   ```sql
   -- SQL generated by Explore
   SELECT "Item.Brand", "Store.State", "StoreSales.TotalSalesQuantity"
   FROM pseudo_table
   GROUP BY "Item.Brand", "Store.State";
   
   -- final SQL after parsing and manipulating the AST
   SELECT * FROM SEMANTIC_VIEW(
       TPCDS_SEMANTIC_VIEW_SM
       DIMENSIONS Item.Brand, Store.State
       METRICS StoreSales.TotalSalesQuantity
   );
   ```
   
   This is not only inefficient and brittle but also limits the flexibility of 
how queries can be executed against different data sources. 
   
   ## Proposed Change
   
   In order to properly support semantic layers we need to move away from the 
current solutions based on pseudo-databases and custom DB engine specs. 
Instead, we should implement first class support for semantic layers, bypassing 
the need for using a dataset when exploring data. This will allow users to 
choose between the semantics provided by Superset datasets, or an external 
system.
   
   The first change will be **the introduction of an `Explorable` interface**, 
defined as a Python protocol. This interface is in some ways similar to the 
existing `ExploreMixin`, that was added when we decided to support `Query` 
objects in explore. The problem with the current `ExploreMixin` is that it’s 
too tightly-coupled with SQLAlchemy and datasets, and has no clear separation 
between the functionality needed for data exploration vs. query generation.
   
   The new `Explorable`, on the other hand, is concerned only with chart 
building:
   
   ```python
   @runtime_checkable
   class Explorable(Protocol):
   
       @property
       def name(self) -> str:
           """Display name for this explorable."""
           ...
   
       @property
       def uid(self) -> str:
           """Unique identifier for this explorable."""
           ...
   
       def get_columns(self) -> set[ColumnInfo]:
           """
           Get available columns/dimensions for exploration.
   
           Returns:
               List of column metadata objects that can be used as dimensions.
           """
           ...
   
       def get_metrics(self) -> set[MetricInfo]:
           """
           Get available metrics for exploration.
   
           Returns:
               List of metric metadata objects that can be aggregated.
           """
           ...
           
       def get_values(
           self,
           dimension: str,
           top_n: int | None,
           filters: set[dict[str, Any]] | None,
       ) -> list[Any]:
           """
           Get distinct values for a dimension/column.
   
           Useful for populating filter dropdowns and understanding data 
cardinality.
   
           Args:
               dimension: Name of the dimension/column to get values for
               top_n: Optional limit on number of values to return (e.g., top 
100 by frequency)
               filters: Optional filters to apply before getting values
   
           Returns:
               List of distinct values for the dimension, optionally limited to 
top_n
           """
           ...
   
       def get_dataframe(self, query_object: QueryObject) -> DataFrame:
           """
           Execute a query and return results as a pandas DataFrame.
   
           This method abstracts away SQL generation, allowing semantic layers
           and other non-SQL data sources to process queries directly.
   
           Args:
               query_object: The query specification from the Explore interface
   
           Returns:
               DataFrame containing the query results
   
           Raises:
               QueryObjectValidationError: If the query object is invalid
               DataSourceError: If there's an error fetching data
           """
           ...
   
       def check_compatibility(
           self,
           selected_metrics: set[str],
           selected_dimensions: set[str],
       ) -> CompatibilityResult:
           """
           Check compatibility of metrics and dimensions for semantic layers.
   
           For traditional SQL databases, this can return all metrics/dimensions
           as compatible. Semantic layers can implement logic to determine which
           combinations are valid based on their metadata.
   
           Args:
               selected_metrics: List of metric names currently selected
               selected_dimensions: List of dimension names currently selected
   
           Returns:
               CompatibilityResult indicating which other metrics/dimensions
               are compatible with the current selection, and which are not
               (with optional reasons).
           """
           ...
   
       @property
       def cache_timeout(self) -> int | None:
           """
           Cache timeout in seconds for query results.
   
           Returns:
               Number of seconds to cache results, or None for default behavior.
           """
           ...
   
       @property
       def default_time_column(self) -> int | None:
           """
           Default time column for time series analysis.
   
           Returns:
               Name of the default time column, or None if not applicable.
           """
           ...
   
       def can_explore(self) -> bool:
           """
           Check if this explorable can currently be used for exploration.
   
           This can perform permission checks, validate configuration,
           or check connectivity for semantic layers.
   
           Returns:
               True if exploration is currently possible, False otherwise.
           """
           ...
   ```
   
   Note the `get_dataframe` method, which returns a Pandas dataframe from a 
`QueryObject`. This allows exploring data from sources that are not SQL based, 
as well as decoupling the SQL generating from SQLAlchemy. This gives a much 
simpler flow:
   
   ```
   Explore controls
          ↓
     QueryObject
          ↓
     Explorable
          ↓
      DataFrame
   ```
   
   And even though we don’t need to, we should move the current SQL generation 
logic (`superset.models.helpers.get_sqla_query`) into the base DB engine spec; 
it would make sense, since the method is tightly coupled with DB engine spec 
attributes anyway. This would allow for DB engine specs to implement their own 
SQL generation logic, for example, using `sqlglot` instead.
   
   In addition to the `Explorable` we would also **add models for a new class 
of connections for semantic layers**. This would be similar to existing 
database connections, but with a few key differences:
   
   1. There will be no concept of a SQLAlchemy URI. The connection info should 
have a well defined schema, which could vary wildly between different semantic 
layers. For example, Malloy could point to a GitHub repository and a database; 
Snowflake would require parameters to build a SQLAlchemy engine, just like 
today; and MetricFlow would require an API key and an optional customer URL. 
This is similar to how some DB engine specs today use the 
`BasicParametersMixin` for a better experience when defining the connection 
parameters.
   2. The association between a given instance of a semantic layer and its 
implementation will be explicit. Today, there’s only an implicit mapping 
between a `Database` instance and the engine spec that should be used with it; 
we infer the DB engine spec based on the scheme of the SQLAlchemy URI, which 
has caused problems in the past, especially because early implementations used 
only the backend name, and not the driver.
   
   An important note is that this change would not be very different from the 
initial versions of Superset, where we had different connectors for Druid (via 
REST and JSON, before its SQL API was introduced) and SQLAlchemy. Having 
different connectors offered some challenges, mostly when defining adhoc 
metrics (since for Druid the user would have to provide JSON). I hope that we 
can avoid these problems by ensuring a consistent interface in Explore that 
works across all types of `Explorable`, while still allowing for flexibility in 
how queries are executed.
   
   Finally, we would also need a lightweight model for explorables, containing 
metadata about them: UUID, name, parent connection, timeout, default metric 
when exploring.
   
   The introduction of semantic layers should open up new interesting 
workflows. A deployment of Superset could have a traditional connection to 
Snowflake, allowing powers users to run SQL directly, as well as defining 
semantic views in SQL Lab. These semantic views could then be exposed to other 
users via the semantic layer connection, providing a curated collection of 
metrics and dimensions that non power users would have access to. Both 
connections to Snowflake, via DB engine spec and via semantic layer, would have 
their purposes and target audiences.
   
   ## Implementation Plan
   
   ### Phase 1: implement the `Explorable`
   
   - Define the `Explorable` protocol.
   - Rename `ExploreMixin` to `SupersetExplorable` and implement protocol.
   - Hook up Explore to the generic `Explorable`.
   
   ### Phase 2: introduce semantic layers
   
   - Add feature flag for enabling semantic layer support.
   - Add new models
   - CRUDIE (create, read, update, delete, import, export) for semantic layers.
   - DAR for semantic layers and semantic models.
   - Allow users to explore semantic models.
   
   ### Phase 3: UI/UX Updates
   
   - Update Explore to support semantic models:
       - No adhoc metrics, derived columns, or custom SQL when not supported.
       - Implement reactive UI for metric/dimension compatibility matrix.
   
   All of these phases require considerable work, and should be done with 
constant feedback from the community when it comes to terminology, UI, and UX.
   
   ## Alternative Approaches Considered
   
   We considered using the plugin architecture for semantic layers. While 
potentially valuable, the proposed approach provides the necessary flexibility 
without the complexity of a full plugin system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to