stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250869815
########## format/spec.md: ########## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + +#### Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-----------------|----------------|----------------------|------------------------------------------|-------------------------------------------|-------------| +| Relative Path | v4 | s3://bucket/db/table | data/00000-0.parquet | s3://bucket/db/table/data/00000-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/00000-0.parquet | hdfs://wh/db/table/data/00000-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/00000-0.parquet | s3://bucket/db/table/data/00000-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/00000-0.parquet | hdfs:/wh/db/table/data/00000-0.parquet | Scheme is prepended for consistency | + +#### Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location, the table location prefix should be removed along with the separator character and the remaining relative portion stored. Review Comment: The new wording — "removed along with the separator character" — is the closest the spec gets to closing the prefix-collision case discussed in apache/iceberg#16174 (https://github.com/apache/iceberg/pull/16174#discussion_r3228742112), but it's only implicit. There can be two different interpretation of this text: - **Strict**: the rule applies only when the prefix is followed by the separator. If the next character isn't `/`, the prefix-removal-with-separator can't be performed, so the absolute path is stored. - **Lax**: the rule says "strip the prefix, then strip the separator if present." `relativize("s3://bucket/table", "s3://bucket/table_v2/file")` would still strip the prefix and produce `_v2/file`. The lax reading doesn't close the prefix-collision case discussed in apache/iceberg#16174 (https://github.com/apache/iceberg/pull/16174#discussion_r3228742112). Worth pinning this down explicitly. Suggested wording: > An absolute path is considered under the table location if and only if it starts with the table location followed by the URI separator `/`. In that case, the table location and the separator are removed, and the remaining portion is stored as a relative path. Otherwise, the absolute path is stored. Also worth adding one more row to the resolution example table (around line 212) that pins this case — e.g. a sibling file at `s3://bucket/db/table_v2/file.parquet` against `s3://bucket/db/table` showing the absolute path is stored, not relativized. ########## format/spec.md: ########## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + +#### Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. Review Comment: Two issues with this paragraph after the latest edit: 1. **`must` → `should` introduces a real ambiguity.** With the new join-with-`/` rule on line 202, a permitted table location of `s3://bucket/table/` plus a relative `data/file` produces `s3://bucket/table//data/file` (double slash). The spec doesn't say whether implementations must reject, normalize, or accept that. Either keep `must`, or add a normalization clause covering trailing-separator table locations. 2. **"without consideration of any additional separator characters"** is a bit unclear. The new wording can be read as "any separators already present are ignored," which contradicts line 202 (the join *adds* a `/`). Suggest rewriting to: "no additional separator characters are introduced beyond the single `/` join character." ########## format/spec.md: ########## @@ -1777,6 +1886,24 @@ Note that these requirements apply when writing data to a v2 table. Tables that This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification to help maintain a uniform experience. +### Path Construction + +Path construction is the process by which new file locations are created for output files referenced by metadata. While the specific construction logic is not strictly required by the spec, the following guidance is provided for reference implementations to encourage consistency. + +The table properties `write.metadata.path` and `write.data.path` control where metadata and data files are written relative to the table location. When not specified, these default to the values `metadata` and `data` respectively. + +For all metadata files: + +* If `write.metadata.path` is an absolute path, it is used directly as the base for new metadata files. +* If `write.metadata.path` is a relative path, the metadata base is the table location joined to the `write.metadata.path` value with a URI separator `/`. + +For data files: + +* If `write.data.path` is an absolute path, it is used directly as the base for new data files. +* If `write.data.path` is a relative path, the base is the table location joined to the `write.data.path` value with a URI separator `/`. + +When persisting paths into metadata, writers should relativize paths against the table location (see [Path Relativization](#path-relativization)). If a file's absolute path shares a common prefix with the table location, the relative portion should be stored. Otherwise, the absolute path should be stored. Review Comment: This recommendation wasn't updated to match the new boundary-aware rule in [Path Relativization](#path-relativization). "Shares a common prefix" is the original ambiguous phrasing — the same one that allows `s3://bucket/table_v2/file` to be "under" `s3://bucket/table` under a byte-prefix interpretation. Suggested rewording to mirror the normative section: > When persisting paths into metadata, writers should relativize paths against the table location (see [Path Relativization](#path-relativization)). If a file's absolute path starts with the table location followed by the URI separator `/`, the relative portion (after removing the prefix and separator) should be stored. Otherwise, the absolute path should be stored. ########## format/spec.md: ########## @@ -134,8 +149,10 @@ Tables do not require rename, except for tables that use atomic rename to implem * **Manifest** -- A file that lists data or delete files; a subset of a snapshot. * **Data file** -- A file that contains rows of a table. * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. +* **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. +* **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. Review Comment: The spec defines **Absolute path** and **Relative path** in the bullets above, but **fully-qualified path** is used here (and on lines 136, 206, 1741, and as a row label in the example table at line 214) without a definition. The term is doing work that the two defined terms can't quite cover. Line 206 says "Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme **if the scheme was omitted** to be consistent with V4 paths." That implies a fully-qualified path may *not* have a URI scheme — which conflicts with the v4 **Absolute path** definition (which requires one). The "Missing scheme" example row at line 215 reinforces this: a pre-v4 path can lack a URI scheme yet still be treated as a usable, non-relative path. Two cleanup options: 1. **Define the term**, e.g.: > **Fully-qualified path** -- A path used in v3 and earlier metadata that is treated as ready-to-use without resolution against the table location. May contain a URI scheme; when omitted, an implementation-defined default scheme is prepended on read. 2. **Drop the term** and use **Absolute path** everywhere, with a clarifying clause that pre-v4 paths omitting a URI scheme are treated as absolute after the default scheme is prepended. Option 2 feels cleaner — one fewer term, and the migration rule on lines 1741-1742 reads more naturally as "v3 paths are absolute paths (with the scheme prepended if missing)." ########## format/spec.md: ########## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + +#### Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. Review Comment: this sentence reads a bit awkward. is this more clear? "Paths in pre-v4 manifests are fully-qualified. When a pre-v4 path omits a URI scheme, readers must prepend a scheme to produce a v4-consistent absolute path." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
