Re: [PR] [#10825] fix(spark-connector): yield to Spark DataSource shortcut for built-in format names [gravitino]

via GitHub Wed, 22 Apr 2026 21:24:19 -0700


gada121982 commented on PR #10826:
URL: https://github.com/apache/gravitino/pull/10826#issuecomment-4301716606

Thanks for the careful review, @FANNG1. I traced through the Spark source
and Gravitino source again to confirm your questions, and the picture is
slightly different from what I wrote in the PR description. Let me clarify.

## 1. Trigger condition — confirmed

You are correct. The bug only reproduces when the current/default catalog is
a Gravitino catalog. Specifically:

**Case 1 — `spark_catalog` is the current catalog, plugin registered but no
`USE <gravitino_cat>`**: `SELECT * FROM parquet.`path`` works normally.

- `ResolveRelations` → `expandIdentifier([parquet, path])` →
`CatalogAndIdentifier(spark_catalog, Identifier([parquet], path))`.
- `CatalogV2Util.loadTable` calls `V2SessionCatalog.loadTable` → V1
`SessionCatalog` throws `NoSuchDatabaseException`, which
`CatalogV2Util.loadTable` swallows
([`CatalogV2Util.scala#L331-L342`](https://github.com/apache/spark/blob/v3.5.3/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L331-L342)).
- `ResolveRelations` returns `u` unresolved → `ResolveSQLOnFile` picks it up
and builds `HadoopFsRelation`.

**Case 2 — current catalog is a Gravitino catalog** (via `USE` or
`spark.sql.defaultCatalog`): same query fails.

- `expandIdentifier` resolves to `CatalogAndIdentifier(gravitino_cat,
Identifier([parquet], path))`.
- `BaseCatalog.loadTable` → `loadGravitinoTable` → REST call to the
Gravitino server.
- Server throws `IllegalArgumentException` (details in §2), which is **not**
one of the three exceptions `CatalogV2Util.loadTable` catches → propagates out
→ query fails.

So the bug is real, but only on the Case 2 path. In practice every
production deployment that uses `GravitinoSparkPlugin` sets
`spark.sql.defaultCatalog` to a Gravitino catalog, so users hit this
consistently — but you are right that the PR description overstates the scope
by not distinguishing the two cases.

## 2. Root cause — narrower than the PR description

The PR description attributes the failure to ``MetadataObjects.of(TABLE,
names)`` in the authorization filter. After re-reading the code, the more
precise description is:

[`MetadataObjects.parse(fullName,
type)`](https://github.com/apache/gravitino/blob/main/api/src/main/java/org/apache/gravitino/MetadataObjects.java#L175-L185)
splits `fullName` on `.` with no escape mechanism:

```java
List<String> parts = DOT_SPLITTER.splitToList(fullName);
...
return MetadataObjects.of(parts, type); // expects parts.size() == 3 for
TABLE
```

When Spark sends `SELECT * FROM parquet.`s3a://bucket/file.parquet``, the
table name passed to the server is `s3a://bucket/file.parquet` — which contains
dots. The server composes the fullName as
`catalog.parquet.s3a://bucket/file.parquet`, splits it, gets 4 parts, and the
length check in `MetadataObjects.of(List, Type)` throws
`IllegalArgumentException`.

This isn't limited to the authorization filter — it is a property of
`MetadataObject`'s name representation itself.

## 3. False-positive — acknowledged

You are right that this PR's guard would incorrectly hide a real Gravitino
table whose schema is literally named `parquet`, `csv`, etc. For example:

```sql
CREATE SCHEMA gravitino_iceberg_cat.parquet;
CREATE TABLE gravitino_iceberg_cat.parquet.users (id INT);
USE gravitino_iceberg_cat;
SELECT * FROM parquet.users; -- would be silently short-circuited
```

In Gravitino this is a semantically valid naming choice, and it's a
reasonable pattern in data lake conventions (`raw_json`, `staging_csv`, or
`parquet` as a landing zone). Silent-wrong is worse than loud-fail, so this is
a blocker for the current approach as-is.

## 4. Alternative fixes — looking for maintainer preference

Given the above, I see four candidate directions, listed narrowest to
broadest. Happy to pivot to whichever one the maintainers prefer.

### Option A — Narrow the client-side guard to path-like names

Replace the `BUILTIN_DATASOURCE_FORMATS` whitelist with a check on
`ident.name()` for path-like characters (`/`, `:`, `\`). A Gravitino-valid
table name cannot contain these on any mainstream catalog backend (Iceberg,
Hive, JDBC all reject them at DDL time), and every SQL-on-File path does. Keeps
scope tiny, eliminates the false-positive for `schema = parquet/csv/json/...`,
and also covers DataSource formats Spark may add in the future (e.g.
`delta.`path``) without maintaining a list.

Trade-off: still a client-side symptom fix; a Gravitino user who somehow did
manage to create a table name containing `/` or `:` (none of the catalog
backends allow this) would see a "table not found" instead of an
`IllegalArgumentException`. Net improvement either way.

### Option B — Convert IAE to NoSuchTableException in the client

In `BaseCatalog.loadGravitinoTable` / `tableExists`, catch
`IllegalArgumentException` (or a specific subclass like
`IllegalNamespaceException`) and rethrow as `NoSuchTableException`. Relies
purely on the server's existing behavior and lets Spark's normal
`ResolveSQLOnFile` fallback take over.

Trade-off: catching a broad exception type is a Java anti-pattern; it could
mask programming errors that happen to throw IAE for unrelated reasons. A
message-substring match (`"length of names must be"`) would be tighter but
fragile across Gravitino versions.

### Option C — Server-side: throw NoSuchTableException instead of IAE

Fix at the source. In `MetadataObjects.parse` (or at the authorization
expression evaluator's call site), when the expected length doesn't match for a
three-level type, throw something that clients naturally map to "not found" —
e.g. `NoSuchEntityException` or `IllegalArgumentException` →
`NoSuchTableException` mapping in `ErrorHandlers`.

Rationale: an identifier that can't be expressed as a `(catalog, schema,
table)` triple semantically means "that entity does not exist in Gravitino",
not "the caller made a programming error". All consumers (Spark, Trino, Flink,
CLI) would benefit without a per-connector workaround.

Trade-off: server change, wider blast radius, needs maintainer discussion on
whether changing the exception class is considered a contract change.

### Option D — Redesign MetadataObject name representation

Support escaping `.` in names, or store names as `List<String>` internally
instead of joining on `.`. This would be the only fix that also allows
Gravitino users to create tables named `user.profile` etc., but it touches the
authorization model, persistence layer, REST URL encoding, CLI, docs, and
existing data. Out of scope for this PR — mentioning for completeness.

## 5. Recommendation and current state

I'd suggest **Option A** for this PR (minimal change, no false-positive,
covers future formats) with **Option C** tracked as a follow-up issue if
maintainers agree it's the right long-term direction.

I'll hold off on force-pushing until I hear which direction you and @diqiu50
prefer. Happy to add integration tests covering all three scenarios
(`default=spark_catalog`, `default=gravitino_cat`, and the `schema=parquet`
false-positive) in the same PR once we converge.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [#10825] fix(spark-connector): yield to Spark DataSource shortcut for built-in format names [gravitino]

Reply via email to