rraulinio opened a new issue, #1205: URL: https://github.com/apache/iceberg-go/issues/1205
### Apache Iceberg version main (development) ### Please describe the bug 🐞 ### Apache Iceberg version main (development) ### Please describe the bug 🐞 > Follow-up to #1190 / #1191 (requested in [this review comment](https://github.com/apache/iceberg-go/pull/1191#issuecomment-4694968108)). This is a **consistency / latent-footgun** fix, not a live wrong-results bug; #1191 already fixed the one path that miscompared. Filing separately because the change has real blast radius. ## Problem `PartitionSpec.PartitionType(schema)` **compacts** its result: a partition field whose source column is absent from `schema` (e.g. the column was dropped) is skipped rather than retained. This diverges from Java Iceberg, whose `PartitionSpec.partitionType()` **retains every field**, substituting `UnknownType` when the source column is gone: ```java // When the source field has been dropped we cannot determine the type if (sourceType == null) { resultType = Types.UnknownType.get(); } ``` Because manifest partition summaries (`ManifestFile.Partitions()` → `[]FieldSummary`) are **positional against the full partition spec**, the compacted struct cannot be used to index them. #1191 worked around this by adding a parallel `manifestPartitionFields` helper in `table/evaluators.go` that retains every field with an `UnknownType` placeholder, leaving the codebase with **two notions of "partition type"** (compacted vs. full). Any future positional consumer that reaches for `PartitionType()` will silently reintroduce the #1190 class of bug. ## Why This Matters - **Single source of truth.** One method should define a spec's partition struct; the `manifestPartitionFields` duplicate exists only to compensate for `PartitionType()`'s compaction. - **Reference parity.** Java retains dropped-source fields as `UnknownType`; matching it removes a subtle behavioral divergence. - **Footgun prevention.** The compaction is an implicit trap for positional callers, not a documented contract. (#1191 adds a doc note to `PartitionType()` warning against positional use as an interim guard.) ## Expected Behavior `PartitionType(schema)` retains **all** partition spec fields in order, using `UnknownType` for any field whose source column is missing from `schema`, matching Java. `manifestPartitionFields` is then deleted and `newManifestEvaluator` calls `PartitionType()` directly. ## Blast Radius (call sites to audit) `PartitionType()` (and the positional `newPartitionRecord`/`GetPartitionRecord` builder that keys values by `FieldID` over `FieldList`) is consumed by: **Write / encode paths** - in practice always invoked with a schema where every partition source is present, so compaction is a no-op today; each must nonetheless tolerate an `UnknownType` placeholder after the change: - `manifest.go`- partition `fieldStats` building, and `partitionTypeToAvroSchema(...)` (**avro mapping for `UnknownType` must be confirmed**) - `data_file_codec.go` - partition encoding - `partitions.go` - `PartitionToPath` - `table/snapshots.go` - `GetPartitionRecord` for produced data/delete files - `table/partitioned_fanout_writer.go`, `table/pos_delete_partitioned_fanout_writer.go` **Read / scan path** - internally consistent today (schema + record built from the *same* compacted list), but should keep working once placeholders appear: - `table/scanner.go` - `buildPartitionEvaluator` (`partSchema` + `GetPartitionRecord`); the projected partition filter never references a dropped source, so an unreferenced `UnknownType` placeholder must remain harmless to `ExpressionEvaluator` binding. ## Proposed Fix 1. Change `PartitionType()` to emit an `UnknownType` placeholder (instead of `continue`) for fields whose source column is absent. 2. Remove `manifestPartitionFields`; have `newManifestEvaluator` use `PartitionType()`. 3. Audit the consumers above - primarily that avro partition-schema conversion and the data-file codec tolerate/guard `UnknownType`, and that evaluator binding ignores unreferenced placeholder fields. 4. Tests: dropped-source partition field round-trips positionally through manifest evaluation (the #1191 regression test), plus avro/codec encode and scan-evaluator coverage with a dropped source present. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
