[PR] [#10825] fix(spark-connector): yield to Spark DataSource shortcut for built-in format names [gravitino]

via GitHub Mon, 20 Apr 2026 22:19:33 -0700


gada121982 opened a new pull request, #10826:
URL: https://github.com/apache/gravitino/pull/10826


   ### What changes were proposed in this pull request?
   
   Short-circuit `BaseCatalog.loadTable` and `BaseCatalog.tableExists` with 
`NoSuchTableException` when the identifier's namespace is a single built-in 
Spark DataSource format name (`parquet`, `csv`, `json`, `orc`, `text`, `avro`, 
`binaryfile`). When the connector declines these lookups, the Spark analyzer 
falls back to its DataSource shortcut resolver and builds a `HadoopFsRelation` 
directly — the same path vanilla Spark takes.
   
   Unqualified Gravitino catalogs/tables are unaffected: they arrive with a 
namespace that references a registered catalog, not a format name, so they 
continue through the existing `loadGravitinoTable` path and the usual 
authorization filter.
   
   ### Why are the changes needed?
   
   `SELECT * FROM parquet.\`path\`` is a long-standing Spark built-in syntax. 
Today, when `GravitinoSparkPlugin` is registered, the analyzer sends 
`(namespace=[parquet], name=<path>)` to `BaseCatalog.loadTable`, which forwards 
it to the server. The server-side authorization filter calls 
`MetadataObjects.of(TABLE, names)` which requires `names.length == 3` 
(catalog.schema.table), so every such query fails with 
`IllegalArgumentException: If the type is TABLE, the length of names must be 3`.
   
   Users expect Spark's built-in shortcut to keep working when the connector is 
installed; today it does not.
   
   Fix: #10825
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. After this PR, the following queries work when \`GravitinoSparkPlugin\` 
is enabled:
   
   \`\`\`sql
   SELECT * FROM parquet.\`s3a://bucket/file.parquet\`;
   SELECT * FROM csv.\`/data/file.csv\`;
   SELECT * FROM json.\`hdfs:///events.json\`;
   SELECT * FROM orc.\`/data/file.orc\`;
   SELECT * FROM text.\`/data/file.txt\`;
   SELECT * FROM avro.\`/data/file.avro\`;
   SELECT * FROM binaryFile.\`/data/folder/\`;
   \`\`\`
   
   Previously all of these failed at the server-side 3-part-name assertion.
   
   No API or property-key changes. No behavior change for user Gravitino tables.
   
   ### How was this patch tested?
   
   - New \`TestBaseCatalog\` covering six cases for the new 
\`isBuiltinDataSourceReference\` helper: built-in formats recognized, 
case-insensitive match, regular schema namespaces ignored, multipart namespaces 
ignored, empty namespaces ignored, unknown formats ignored.
   - \`./gradlew :spark-connector:spark-common:test --tests 
org.apache.gravitino.spark.connector.catalog.TestBaseCatalog\` → all tests pass.
   - \`./gradlew :spark-connector:spark-common:spotlessCheck\` → clean.
   - End-to-end manual verification on Spark 3.5.8 + Gravitino 1.2.0 with GVFS: 
\`SELECT * FROM parquet.\`gvfs://fileset/.../file.parquet\`\` now returns rows; 
\`SELECT * FROM <gravitino_iceberg_cat>.db.t\` remains unaffected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [#10825] fix(spark-connector): yield to Spark DataSource shortcut for built-in format names [gravitino]

Reply via email to