gada121982 commented on PR #10826:
URL: https://github.com/apache/gravitino/pull/10826#issuecomment-4301716606

   Thanks for the careful review, @FANNG1. I traced through the Spark source 
and Gravitino source again to confirm your questions, and the picture is 
slightly different from what I wrote in the PR description. Let me clarify.
   
   ## 1. Trigger condition — confirmed
   
   You are correct. The bug only reproduces when the current/default catalog is 
a Gravitino catalog. Specifically:
   
   **Case 1 — `spark_catalog` is the current catalog, plugin registered but no 
`USE <gravitino_cat>`**: `SELECT * FROM parquet.`path`` works normally.
   
   - `ResolveRelations` → `expandIdentifier([parquet, path])` → 
`CatalogAndIdentifier(spark_catalog, Identifier([parquet], path))`.
   - `CatalogV2Util.loadTable` calls `V2SessionCatalog.loadTable` → V1 
`SessionCatalog` throws `NoSuchDatabaseException`, which 
`CatalogV2Util.loadTable` swallows 
([`CatalogV2Util.scala#L331-L342`](https://github.com/apache/spark/blob/v3.5.3/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L331-L342)).
   - `ResolveRelations` returns `u` unresolved → `ResolveSQLOnFile` picks it up 
and builds `HadoopFsRelation`.
   
   **Case 2 — current catalog is a Gravitino catalog** (via `USE` or 
`spark.sql.defaultCatalog`): same query fails.
   
   - `expandIdentifier` resolves to `CatalogAndIdentifier(gravitino_cat, 
Identifier([parquet], path))`.
   - `BaseCatalog.loadTable` → `loadGravitinoTable` → REST call to the 
Gravitino server.
   - Server throws `IllegalArgumentException` (details in §2), which is **not** 
one of the three exceptions `CatalogV2Util.loadTable` catches → propagates out 
→ query fails.
   
   So the bug is real, but only on the Case 2 path. In practice every 
production deployment that uses `GravitinoSparkPlugin` sets 
`spark.sql.defaultCatalog` to a Gravitino catalog, so users hit this 
consistently — but you are right that the PR description overstates the scope 
by not distinguishing the two cases.
   
   ## 2. Root cause — narrower than the PR description
   
   The PR description attributes the failure to ``MetadataObjects.of(TABLE, 
names)`` in the authorization filter. After re-reading the code, the more 
precise description is:
   
   [`MetadataObjects.parse(fullName, 
type)`](https://github.com/apache/gravitino/blob/main/api/src/main/java/org/apache/gravitino/MetadataObjects.java#L175-L185)
 splits `fullName` on `.` with no escape mechanism:
   
   ```java
   List<String> parts = DOT_SPLITTER.splitToList(fullName);
   ...
   return MetadataObjects.of(parts, type);  // expects parts.size() == 3 for 
TABLE
   ```
   
   When Spark sends `SELECT * FROM parquet.`s3a://bucket/file.parquet``, the 
table name passed to the server is `s3a://bucket/file.parquet` — which contains 
dots. The server composes the fullName as 
`catalog.parquet.s3a://bucket/file.parquet`, splits it, gets 4 parts, and the 
length check in `MetadataObjects.of(List, Type)` throws 
`IllegalArgumentException`.
   
   This isn't limited to the authorization filter — it is a property of 
`MetadataObject`'s name representation itself.
   
   ## 3. False-positive — acknowledged
   
   You are right that this PR's guard would incorrectly hide a real Gravitino 
table whose schema is literally named `parquet`, `csv`, etc. For example:
   
   ```sql
   CREATE SCHEMA gravitino_iceberg_cat.parquet;
   CREATE TABLE gravitino_iceberg_cat.parquet.users (id INT);
   USE gravitino_iceberg_cat;
   SELECT * FROM parquet.users;  -- would be silently short-circuited
   ```
   
   In Gravitino this is a semantically valid naming choice, and it's a 
reasonable pattern in data lake conventions (`raw_json`, `staging_csv`, or 
`parquet` as a landing zone). Silent-wrong is worse than loud-fail, so this is 
a blocker for the current approach as-is.
   
   ## 4. Alternative fixes — looking for maintainer preference
   
   Given the above, I see four candidate directions, listed narrowest to 
broadest. Happy to pivot to whichever one the maintainers prefer.
   
   ### Option A — Narrow the client-side guard to path-like names
   
   Replace the `BUILTIN_DATASOURCE_FORMATS` whitelist with a check on 
`ident.name()` for path-like characters (`/`, `:`, `\`). A Gravitino-valid 
table name cannot contain these on any mainstream catalog backend (Iceberg, 
Hive, JDBC all reject them at DDL time), and every SQL-on-File path does. Keeps 
scope tiny, eliminates the false-positive for `schema = parquet/csv/json/...`, 
and also covers DataSource formats Spark may add in the future (e.g. 
`delta.`path``) without maintaining a list.
   
   Trade-off: still a client-side symptom fix; a Gravitino user who somehow did 
manage to create a table name containing `/` or `:` (none of the catalog 
backends allow this) would see a "table not found" instead of an 
`IllegalArgumentException`. Net improvement either way.
   
   ### Option B — Convert IAE to NoSuchTableException in the client
   
   In `BaseCatalog.loadGravitinoTable` / `tableExists`, catch 
`IllegalArgumentException` (or a specific subclass like 
`IllegalNamespaceException`) and rethrow as `NoSuchTableException`. Relies 
purely on the server's existing behavior and lets Spark's normal 
`ResolveSQLOnFile` fallback take over.
   
   Trade-off: catching a broad exception type is a Java anti-pattern; it could 
mask programming errors that happen to throw IAE for unrelated reasons. A 
message-substring match (`"length of names must be"`) would be tighter but 
fragile across Gravitino versions.
   
   ### Option C — Server-side: throw NoSuchTableException instead of IAE
   
   Fix at the source. In `MetadataObjects.parse` (or at the authorization 
expression evaluator's call site), when the expected length doesn't match for a 
three-level type, throw something that clients naturally map to "not found" — 
e.g. `NoSuchEntityException` or `IllegalArgumentException` → 
`NoSuchTableException` mapping in `ErrorHandlers`.
   
   Rationale: an identifier that can't be expressed as a `(catalog, schema, 
table)` triple semantically means "that entity does not exist in Gravitino", 
not "the caller made a programming error". All consumers (Spark, Trino, Flink, 
CLI) would benefit without a per-connector workaround.
   
   Trade-off: server change, wider blast radius, needs maintainer discussion on 
whether changing the exception class is considered a contract change.
   
   ### Option D — Redesign MetadataObject name representation
   
   Support escaping `.` in names, or store names as `List<String>` internally 
instead of joining on `.`. This would be the only fix that also allows 
Gravitino users to create tables named `user.profile` etc., but it touches the 
authorization model, persistence layer, REST URL encoding, CLI, docs, and 
existing data. Out of scope for this PR — mentioning for completeness.
   
   ## 5. Recommendation and current state
   
   I'd suggest **Option A** for this PR (minimal change, no false-positive, 
covers future formats) with **Option C** tracked as a follow-up issue if 
maintainers agree it's the right long-term direction.
   
   I'll hold off on force-pushing until I hear which direction you and @diqiu50 
prefer. Happy to add integration tests covering all three scenarios 
(`default=spark_catalog`, `default=gravitino_cat`, and the `schema=parquet` 
false-positive) in the same PR once we converge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to