stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251040825
########## format/spec.md: ########## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + +#### Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-----------------|----------------|----------------------|------------------------------------------|-------------------------------------------|-------------| +| Relative Path | v4 | s3://bucket/db/table | data/00000-0.parquet | s3://bucket/db/table/data/00000-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/00000-0.parquet | hdfs://wh/db/table/data/00000-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/00000-0.parquet | s3://bucket/db/table/data/00000-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/00000-0.parquet | hdfs:/wh/db/table/data/00000-0.parquet | Scheme is prepended for consistency | + +#### Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location, the table location prefix should be removed along with the separator character and the remaining relative portion stored. +* If an absolute path does not start with the table location, it is stored as an absolute path. + +#### Table Location Specification + +When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should and provide a table's location Review Comment: Two issues in this paragraph: 1. **Grammatical error**: "catalogs should and provide a table's location" is missing a verb (e.g. "manage and provide"), and the sentence has no closing period. 2. **Implicit actor**: "the table location must be provided" doesn't say by whom, then the next clause silently assumes the catalog. Suggested rewrite: > When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the catalog must provide the table location. How the catalog persists or determines a table's location is outside the scope of this spec. Changes: - Names the catalog as the actor in the second sentence. - Folds "catalogs should and provide a table's location" into the new third sentence, replacing the broken clause and adding the missing period. - "is not a table-level concern" → "is outside the scope of this spec" — clearer about what the spec is leaving unspecified. ########## format/spec.md: ########## @@ -1647,6 +1732,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. + +Reading v3 metadata for v4: + +* All location fields are fully-qualified paths and interpreted as absolute paths for v4 +* Any location field without a uri scheme prefix must prepend a scheme component consistent with v4 absolute paths + +Writing v4 metadata: + +* Table metadata JSON: + * `location` is now optional and must be absolute when present + * When not present, the table location must be managed externally and provided when loading the metadata +* Location fields in all metadata structures may contain relative paths +* Writers should produce relative paths by default for files that reside under the table location +* Absolute paths must be used for files that do not share a common prefix with the table location Review Comment: Same "shares a common prefix" ambiguity flagged in the inline at line 1905. Line 1750 above ("files that reside under the table location") is also informal terminology for what is now formally defined in [Path Relativization](#path-relativization). Suggested rewording for lines 1750–1751: > * Writers should produce relative paths by default for files whose absolute path starts with the table location followed by the URI separator `/`. > * Absolute paths must be used for all other files. This way the writer rule is exactly the Path Relativization rule applied — no chance of producing a relative form that doesn't round-trip safely under table relocation. ########## format/spec.md: ########## @@ -1777,6 +1886,24 @@ Note that these requirements apply when writing data to a v2 table. Tables that This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification to help maintain a uniform experience. +### Path Construction + +Path construction is the process by which new file locations are created for output files referenced by metadata. While the specific construction logic is not strictly required by the spec, the following guidance is provided for reference implementations to encourage consistency. + +The table properties `write.metadata.path` and `write.data.path` control where metadata and data files are written relative to the table location. When not specified, these default to the values `metadata` and `data` respectively. Review Comment: This sentence contradicts the bullets directly below. The opening says the properties control where files are written **"relative to the table location,"** but the bullets at lines 1897 and 1902 explicitly handle the case where `write.metadata.path` / `write.data.path` is an **absolute** path that's used directly, bypassing the table location. Suggested rewrite: > The table properties `write.metadata.path` and `write.data.path` control where metadata and data files are written. Each property may be an absolute path or a relative path; relative paths are resolved against the table location. When not specified, these default to the relative values `metadata` and `data`. This removes the contradiction and surfaces the absolute/relative distinction upfront so the bullets that follow read as enumeration rather than exceptions. ########## format/spec.md: ########## @@ -1647,6 +1732,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. Review Comment: I feel we probably need to introduce sub headers for Version 4, as we have so many changes scoped for V4. if each feature has multiple paragraphs, it is kind to track. The alternative is that each feature only has bullet points. E.g. the row lineage in V3 section below has a large bullet list. I am slightly favor the former. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
