mkaravel commented on code in PR #10981: URL: https://github.com/apache/iceberg/pull/10981#discussion_r1879219879
########## format/spec.md: ########## @@ -205,13 +205,18 @@ Supported primitive types are defined in the table below. Primitive types added | | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | | **`fixed(L)`** | Fixed-length byte array of length L | | | | **`binary`** | Arbitrary-length byte array | | +| [v3](#version-3) | **`geometry(C)`** | Geometry features from [OGC – Simple feature access](https://portal.ogc.org/files/?artifact_id=25355). Edges interpolation is always linear/planar. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by crs C [4]. If not specified, C is `OGC:CRS84`. | | +| [v3](#version-3) | **`geography(C, A)`** | Geometry features from [OGC – Simple feature access](https://portal.ogc.org/files/?artifact_id=25355). See [Appendix G](#appendix-g-geospatial-notes). Parameterized by crs C[5] and edge-interpolation algoritm A [6]. If not specified, C is `OGC:CRS84`. | | + Notes: 1. Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 2. Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). 3. Character strings must be stored as UTF-8 encoded byte arrays. - +4. CRS (coordinate reference system) is a mapping of how coordinates refer to locations on earth. See [Appendix G](#appendix-g-geospatial-notes) for specifying custom CRS. If this field is null (no custom crs provided), CRS defaults to `OGC:CRS84`, which means the data must be stored in longitude, latitude based on the WGS84 datum. Fixed and cannot be changed by schema evolution. Review Comment: ```suggestion 4. CRS (coordinate reference system) is a mapping of how coordinates refer to locations on Earth. See [Appendix G](#appendix-g-geospatial-notes) for specifying custom CRS. If this field is null (no custom crs provided), CRS defaults to `OGC:CRS84`, which means the data must be stored in longitude, latitude based on the WGS84 datum. Fixed and cannot be changed by schema evolution. ``` nit ########## format/spec.md: ########## @@ -1480,6 +1497,9 @@ This serialization scheme is for storing single values as individual binary valu | **`struct`** | Not supported | | **`list`** | Not supported | | **`map`** | Not supported | +| **`geometry`** | A single point, encoded as a {x, y, optional z, optional m} concatenation of its 8-byte little-endian IEEE 754 coordinate values, with the optional coordinates encoded as NaN if unset. | +| **`geography`** | A single point, encoded as a {x, y, optional z, optional m} concatenation of its 8-byte little-endian IEEE 754 coordinate values, with the optional coordinates encoded as NaN if unset. | Review Comment: I am a bit confused here. This is, technically, not WKB (referred to above). This is just 4 double values stored in little-endian format. WKB for points has 1 byte for the endianness, 4 bytes for the type and then up to 4 doubles for the coordinates. Why is this different from WKB here? ########## format/spec.md: ########## @@ -603,6 +608,10 @@ Notes: 4. Position delete metadata can use `referenced_data_file` when all deletes tracked by the entry are in a single data file. Setting the referenced file is required for deletion vectors. 5. The `content_offset` and `content_size_in_bytes` fields are used to reference a specific blob for direct access to a deletion vector. For deletion vectors, these values are required and must exactly match the `offset` and `length` stored in the Puffin footer for the deletion vector blob. 6. The following field ids are reserved on `data_file`: 141. +7. `geometry`, this is a point: X, Y, Z, and M take the min value of all component points of all geometries in file. See Appendix D for encoding. +8. `geography`, this is a point: X = westernmost bound of all geometries in file, Y = northernmost bound of all geometries in file, Z is min value for all component points of all geometries in the file, M is min value of all component points of all geometries in the file. See Appendix D for encoding. +9. `geometry`, this is a point: X, Y, Z, and M take the max value of all component points of all geometries in file. See Appendix D for encoding. +10. `geography`, this is a point: X = easternmost bound of all geometries in file, Y = southernmost bound of all geometries in file, Z is max value for all component points of all geometries in the file, M is max value of all component points of all geometries in the file. See Appendix D for encoding. Review Comment: Same here (obviously for the easternmost part). ########## format/spec.md: ########## @@ -603,6 +608,10 @@ Notes: 4. Position delete metadata can use `referenced_data_file` when all deletes tracked by the entry are in a single data file. Setting the referenced file is required for deletion vectors. 5. The `content_offset` and `content_size_in_bytes` fields are used to reference a specific blob for direct access to a deletion vector. For deletion vectors, these values are required and must exactly match the `offset` and `length` stored in the Puffin footer for the deletion vector blob. 6. The following field ids are reserved on `data_file`: 141. +7. `geometry`, this is a point: X, Y, Z, and M take the min value of all component points of all geometries in file. See Appendix D for encoding. +8. `geography`, this is a point: X = westernmost bound of all geometries in file, Y = northernmost bound of all geometries in file, Z is min value for all component points of all geometries in the file, M is min value of all component points of all geometries in the file. See Appendix D for encoding. Review Comment: This definition works in almost all cases, except when the longitudes span the entire space (think boundary of Antarctica). I am not sure this is important, and whether it needs to be specified further. ########## format/spec.md: ########## @@ -584,8 +589,8 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo | _optional_ | _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | | _optional_ | _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | | _optional_ | _optional_ | _optional_ | **`111 distinct_counts`** | `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | +| _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2]. See [7] for`geometry` and [8] for `geography`. | +| _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2]. See [9] for `geometry` and [10] for `geography`. | Review Comment: I am not very familiar with the Iceberg spec. Does this accommodate for the geographic case where the "minimum" longitude is larger than the "maximum" longitude (or if you want the lower bound is larger than the upper bound for longitudes)? ########## format/spec.md: ########## @@ -205,13 +205,18 @@ Supported primitive types are defined in the table below. Primitive types added | | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | | **`fixed(L)`** | Fixed-length byte array of length L | | | | **`binary`** | Arbitrary-length byte array | | +| [v3](#version-3) | **`geometry(C)`** | Geometry features from [OGC – Simple feature access](https://portal.ogc.org/files/?artifact_id=25355). Edges interpolation is always linear/planar. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by crs C [4]. If not specified, C is `OGC:CRS84`. | | Review Comment: ```suggestion | [v3](#version-3) | **`geometry(C)`** | Geometry features from [OGC – Simple feature access](https://portal.ogc.org/files/?artifact_id=25355). Edges interpolation is always linear. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by crs C [4]. If not specified, C is `OGC:CRS84`. | | ``` In this sentence I see "edges" and "linear/planar". I am not sure what "planar edge" means from the mathematical point of view, one could assume that we are talking about a linear edge embedded in a plane. This does not fit very well for geometries with more than 2 dimensions (I feel it is confusing). I think it is more than enough to say just "linear". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org