gada121982 commented on PR #10826: URL: https://github.com/apache/gravitino/pull/10826#issuecomment-4301716606
Thanks for the careful review, @FANNG1. I traced through the Spark source and Gravitino source again to confirm your questions, and the picture is slightly different from what I wrote in the PR description. Let me clarify. ## 1. Trigger condition — confirmed You are correct. The bug only reproduces when the current/default catalog is a Gravitino catalog. Specifically: **Case 1 — `spark_catalog` is the current catalog, plugin registered but no `USE <gravitino_cat>`**: `SELECT * FROM parquet.`path`` works normally. - `ResolveRelations` → `expandIdentifier([parquet, path])` → `CatalogAndIdentifier(spark_catalog, Identifier([parquet], path))`. - `CatalogV2Util.loadTable` calls `V2SessionCatalog.loadTable` → V1 `SessionCatalog` throws `NoSuchDatabaseException`, which `CatalogV2Util.loadTable` swallows ([`CatalogV2Util.scala#L331-L342`](https://github.com/apache/spark/blob/v3.5.3/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L331-L342)). - `ResolveRelations` returns `u` unresolved → `ResolveSQLOnFile` picks it up and builds `HadoopFsRelation`. **Case 2 — current catalog is a Gravitino catalog** (via `USE` or `spark.sql.defaultCatalog`): same query fails. - `expandIdentifier` resolves to `CatalogAndIdentifier(gravitino_cat, Identifier([parquet], path))`. - `BaseCatalog.loadTable` → `loadGravitinoTable` → REST call to the Gravitino server. - Server throws `IllegalArgumentException` (details in §2), which is **not** one of the three exceptions `CatalogV2Util.loadTable` catches → propagates out → query fails. So the bug is real, but only on the Case 2 path. In practice every production deployment that uses `GravitinoSparkPlugin` sets `spark.sql.defaultCatalog` to a Gravitino catalog, so users hit this consistently — but you are right that the PR description overstates the scope by not distinguishing the two cases. ## 2. Root cause — narrower than the PR description The PR description attributes the failure to ``MetadataObjects.of(TABLE, names)`` in the authorization filter. After re-reading the code, the more precise description is: [`MetadataObjects.parse(fullName, type)`](https://github.com/apache/gravitino/blob/main/api/src/main/java/org/apache/gravitino/MetadataObjects.java#L175-L185) splits `fullName` on `.` with no escape mechanism: ```java List<String> parts = DOT_SPLITTER.splitToList(fullName); ... return MetadataObjects.of(parts, type); // expects parts.size() == 3 for TABLE ``` When Spark sends `SELECT * FROM parquet.`s3a://bucket/file.parquet``, the table name passed to the server is `s3a://bucket/file.parquet` — which contains dots. The server composes the fullName as `catalog.parquet.s3a://bucket/file.parquet`, splits it, gets 4 parts, and the length check in `MetadataObjects.of(List, Type)` throws `IllegalArgumentException`. This isn't limited to the authorization filter — it is a property of `MetadataObject`'s name representation itself. ## 3. False-positive — acknowledged You are right that this PR's guard would incorrectly hide a real Gravitino table whose schema is literally named `parquet`, `csv`, etc. For example: ```sql CREATE SCHEMA gravitino_iceberg_cat.parquet; CREATE TABLE gravitino_iceberg_cat.parquet.users (id INT); USE gravitino_iceberg_cat; SELECT * FROM parquet.users; -- would be silently short-circuited ``` In Gravitino this is a semantically valid naming choice, and it's a reasonable pattern in data lake conventions (`raw_json`, `staging_csv`, or `parquet` as a landing zone). Silent-wrong is worse than loud-fail, so this is a blocker for the current approach as-is. ## 4. Alternative fixes — looking for maintainer preference Given the above, I see four candidate directions, listed narrowest to broadest. Happy to pivot to whichever one the maintainers prefer. ### Option A — Narrow the client-side guard to path-like names Replace the `BUILTIN_DATASOURCE_FORMATS` whitelist with a check on `ident.name()` for path-like characters (`/`, `:`, `\`). A Gravitino-valid table name cannot contain these on any mainstream catalog backend (Iceberg, Hive, JDBC all reject them at DDL time), and every SQL-on-File path does. Keeps scope tiny, eliminates the false-positive for `schema = parquet/csv/json/...`, and also covers DataSource formats Spark may add in the future (e.g. `delta.`path``) without maintaining a list. Trade-off: still a client-side symptom fix; a Gravitino user who somehow did manage to create a table name containing `/` or `:` (none of the catalog backends allow this) would see a "table not found" instead of an `IllegalArgumentException`. Net improvement either way. ### Option B — Convert IAE to NoSuchTableException in the client In `BaseCatalog.loadGravitinoTable` / `tableExists`, catch `IllegalArgumentException` (or a specific subclass like `IllegalNamespaceException`) and rethrow as `NoSuchTableException`. Relies purely on the server's existing behavior and lets Spark's normal `ResolveSQLOnFile` fallback take over. Trade-off: catching a broad exception type is a Java anti-pattern; it could mask programming errors that happen to throw IAE for unrelated reasons. A message-substring match (`"length of names must be"`) would be tighter but fragile across Gravitino versions. ### Option C — Server-side: throw NoSuchTableException instead of IAE Fix at the source. In `MetadataObjects.parse` (or at the authorization expression evaluator's call site), when the expected length doesn't match for a three-level type, throw something that clients naturally map to "not found" — e.g. `NoSuchEntityException` or `IllegalArgumentException` → `NoSuchTableException` mapping in `ErrorHandlers`. Rationale: an identifier that can't be expressed as a `(catalog, schema, table)` triple semantically means "that entity does not exist in Gravitino", not "the caller made a programming error". All consumers (Spark, Trino, Flink, CLI) would benefit without a per-connector workaround. Trade-off: server change, wider blast radius, needs maintainer discussion on whether changing the exception class is considered a contract change. ### Option D — Redesign MetadataObject name representation Support escaping `.` in names, or store names as `List<String>` internally instead of joining on `.`. This would be the only fix that also allows Gravitino users to create tables named `user.profile` etc., but it touches the authorization model, persistence layer, REST URL encoding, CLI, docs, and existing data. Out of scope for this PR — mentioning for completeness. ## 5. Recommendation and current state I'd suggest **Option A** for this PR (minimal change, no false-positive, covers future formats) with **Option C** tracked as a follow-up issue if maintainers agree it's the right long-term direction. I'll hold off on force-pushing until I hear which direction you and @diqiu50 prefer. Happy to add integration tests covering all three scenarios (`default=spark_catalog`, `default=gravitino_cat`, and the `schema=parquet` false-positive) in the same PR once we converge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
