FANNG1 commented on PR #10721: URL: https://github.com/apache/gravitino/pull/10721#issuecomment-4231630332
Thanks for the fix! The overall approach — adding a `toGravitinoDistribution` hook on `BaseCatalog` and overriding it in the Paimon catalog — is clean and minimally invasive. I have two concerns I'd like to discuss before merging. ## 1. (Blocker) Missing `bucket-key` is silently dropped on PK tables `GravitinoPaimonCatalog.getDistribution` returns `Distributions.NONE` whenever `bucket-key` is blank, and `PaimonPropertiesConverter.toGravitinoTableProperties` then strips both `bucket-key` and `bucket`. But per the [Paimon 1.2 config docs](https://paimon.apache.org/docs/1.2/maintenance/configurations/#bucket), `bucket-key` is **optional and defaults to the primary key** for PK tables. So the following perfectly valid statement: ```sql CREATE TABLE t ( id BIGINT, name STRING, PRIMARY KEY (id) NOT ENFORCED ) WITH ('bucket' = '8'); ``` will be silently created with no distribution at all instead of an 8-bucket HASH(id) table. This is worse than the original bug — it's silent semantic corruption of the user's table layout. **Suggested fix (sketch):** let the hook see schema/PK info (the `resolvedTable` is already in scope in `BaseCatalog.createTable`), and in the Paimon override fall back to the primary key columns when `bucket-key` is blank. If there's no PK either and `bucket` is set, fail fast with a clear error rather than silently dropping it. Please also add tests for: PK + only `bucket`, PK + explicit `bucket-key` overriding PK, and non-PK + only `bucket`. ## 2. Load path doesn't reconstruct `bucket-key` / `bucket` from `Distribution` The PR converts properties → `Distribution` on create, but doesn't do the reverse on load. Today this happens to work because `GravitinoPaimonTable.fromPaimonTable` (`GravitinoPaimonTable.java:121`) does `withProperties(table.options())`, leaving the raw `bucket-key`/`bucket` in `properties`, and the Flink connector just passes them through. **But that's an implicit dependency**, not something this PR establishes. Risks of relying on it: - **Asymmetry / fragility**: if anyone later cleans up the server side to make `Distribution` the single source of truth (and stops echoing reserved properties back into `properties`), Flink load will silently lose bucket info. - **Cross-entry consistency**: a table created via REST / Spark / Java client may have `Distribution` set but not have `bucket-key`/`bucket` in `properties`. Flink load would then be inconsistent with Spark load on the same table. - **Alter path**: `BaseCatalog.alterTable` calls `getTable(tablePath)` first and diffs against the new table. If `existingTable.getOptions()` is missing the bucket keys, an unrelated alter could end up dropping the distribution. **Suggested fix (sketch):** add a symmetric reverse step in `GravitinoPaimonCatalog.toFlinkTable` (or a `fromGravitinoDistribution` hook on `BaseCatalog`) that reconstructs `bucket-key` / `bucket` from `table.distribution()`, treating `Distribution` as the authoritative source rather than relying on properties round-tripping through the server. At a minimum, please add a **round-trip integration test**: create a table with `bucket` / `bucket-key` via Flink SQL, then load it back through Flink (`SHOW CREATE TABLE` or `tableEnv.getCatalog().getTable(...)`) and assert both options are present and correct — without this, the implicit dependency above will silently break in the future. --- **Summary:** (1) is a blocker (silent data-layout corruption); (2) is strongly recommended, or at the very least a round-trip test should be added. Happy to discuss — thanks again for picking this up! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
