mkaravel commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1879219879


##########
format/spec.md:
##########
@@ -205,13 +205,18 @@ Supported primitive types are defined in the table below. 
Primitive types added
 |                  | **`uuid`**         | Universally unique identifiers       
                                    | Should use 16-byte fixed                  
       |
 |                  | **`fixed(L)`**     | Fixed-length byte array of length L  
                                    |                                           
       |
 |                  | **`binary`**       | Arbitrary-length byte array          
                                    |                                           
       |
+| [v3](#version-3) | **`geometry(C)`**  | Geometry features from [OGC – Simple 
feature access](https://portal.ogc.org/files/?artifact_id=25355). Edges 
interpolation is always linear/planar. See [Appendix 
G](#appendix-g-geospatial-notes). Parameterized by crs C [4]. If not specified, 
C is `OGC:CRS84`. |                                                        |
+| [v3](#version-3) | **`geography(C, A)`**  | Geometry features from [OGC – 
Simple feature access](https://portal.ogc.org/files/?artifact_id=25355). See 
[Appendix G](#appendix-g-geospatial-notes). Parameterized by crs C[5] and 
edge-interpolation algoritm A [6]. If not specified, C is `OGC:CRS84`. |        
                                                |
+
 
 Notes:
 
 1. Timestamp values _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
 2. Timestamp values _with time zone_ represent a point in time: values are 
stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` 
is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are 
considered identical).
 3. Character strings must be stored as UTF-8 encoded byte arrays.
-
+4. CRS (coordinate reference system) is a mapping of how coordinates refer to 
locations on earth. See [Appendix G](#appendix-g-geospatial-notes) for 
specifying custom CRS. If this field is null (no custom crs provided), CRS 
defaults to `OGC:CRS84`, which means the data must be stored in longitude, 
latitude based on the WGS84 datum. Fixed and cannot be changed by schema 
evolution.

Review Comment:
   ```suggestion
   4. CRS (coordinate reference system) is a mapping of how coordinates refer 
to locations on Earth. See [Appendix G](#appendix-g-geospatial-notes) for 
specifying custom CRS. If this field is null (no custom crs provided), CRS 
defaults to `OGC:CRS84`, which means the data must be stored in longitude, 
latitude based on the WGS84 datum. Fixed and cannot be changed by schema 
evolution.
   ```
   nit



##########
format/spec.md:
##########
@@ -1480,6 +1497,9 @@ This serialization scheme is for storing single values as 
individual binary valu
 | **`struct`**                 | Not supported                                 
                                                               |
 | **`list`**                   | Not supported                                 
                                                               |
 | **`map`**                    | Not supported                                 
                                                               |
+| **`geometry`**               | A single point, encoded as a {x, y, optional 
z, optional m} concatenation of its 8-byte little-endian IEEE 754 coordinate 
values, with the optional coordinates encoded as NaN if unset. |
+| **`geography`**              | A single point, encoded as a {x, y, optional 
z, optional m} concatenation of its 8-byte little-endian IEEE 754 coordinate 
values, with the optional coordinates encoded as NaN if unset. |

Review Comment:
   I am a bit confused here. This is, technically, not WKB (referred to above). 
This is just 4 double values stored in little-endian format.
   WKB for points has 1 byte for the endianness, 4 bytes for the type and then 
up to 4 doubles for the coordinates.
   Why is this different from WKB here?



##########
format/spec.md:
##########
@@ -603,6 +608,10 @@ Notes:
 4. Position delete metadata can use `referenced_data_file` when all deletes 
tracked by the entry are in a single data file. Setting the referenced file is 
required for deletion vectors.
 5. The `content_offset` and `content_size_in_bytes` fields are used to 
reference a specific blob for direct access to a deletion vector. For deletion 
vectors, these values are required and must exactly match the `offset` and 
`length` stored in the Puffin footer for the deletion vector blob.
 6. The following field ids are reserved on `data_file`: 141.
+7. `geometry`, this is a point: X, Y, Z, and M take the min value of all 
component points of all geometries in file. See Appendix D for encoding.
+8. `geography`, this is a point: X = westernmost bound of all geometries in 
file, Y = northernmost bound of all geometries in file, Z is min value for all 
component points of all geometries in the file, M is min value of all component 
points of all geometries in the file. See Appendix D for encoding.
+9. `geometry`, this is a point: X, Y, Z, and M take the max value of all 
component points of all geometries in file. See Appendix D for encoding.
+10. `geography`, this is a point: X = easternmost bound of all geometries in 
file, Y = southernmost bound of all geometries in file, Z is max value for all 
component points of all geometries in the file, M is max value of all component 
points of all geometries in the file. See Appendix D for encoding.

Review Comment:
   Same here (obviously for the easternmost part).



##########
format/spec.md:
##########
@@ -603,6 +608,10 @@ Notes:
 4. Position delete metadata can use `referenced_data_file` when all deletes 
tracked by the entry are in a single data file. Setting the referenced file is 
required for deletion vectors.
 5. The `content_offset` and `content_size_in_bytes` fields are used to 
reference a specific blob for direct access to a deletion vector. For deletion 
vectors, these values are required and must exactly match the `offset` and 
`length` stored in the Puffin footer for the deletion vector blob.
 6. The following field ids are reserved on `data_file`: 141.
+7. `geometry`, this is a point: X, Y, Z, and M take the min value of all 
component points of all geometries in file. See Appendix D for encoding.
+8. `geography`, this is a point: X = westernmost bound of all geometries in 
file, Y = northernmost bound of all geometries in file, Z is min value for all 
component points of all geometries in the file, M is min value of all component 
points of all geometries in the file. See Appendix D for encoding.

Review Comment:
   This definition works in almost all cases, except when the longitudes span 
the entire space (think boundary of Antarctica). I am not sure this is 
important, and whether it needs to be specified further.



##########
format/spec.md:
##########
@@ -584,8 +589,8 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 | _optional_ | _optional_ | _optional_ | **`110  null_value_counts`**      | 
`map<121: int, 122: long>`                                                  | 
Map from column id to number of null values in the column                       
                                                                                
                                                   |
 | _optional_ | _optional_ | _optional_ | **`137  nan_value_counts`**       | 
`map<138: int, 139: long>`                                                  | 
Map from column id to number of NaN values in the column                        
                                                                                
                                                   |
 | _optional_ | _optional_ | _optional_ | **`111  distinct_counts`**        | 
`map<123: int, 124: long>`                                                  | 
Map from column id to number of distinct values in the column; distinct counts 
must be derived using values in the file by counting or using sketches, but not 
using methods like merging existing distinct counts |
-| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**           | 
`map<126: int, 127: binary>`                                                | 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]                                     |
-| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**           | 
`map<129: int, 130: binary>`                                                | 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]                                  |
+| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**           | 
`map<126: int, 127: binary>`                                                | 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]. See [7] for`geometry` and [8] for `geography`.  |
+| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**           | 
`map<129: int, 130: binary>`                                                | 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]. See [9] for `geometry` and [10] for `geography`. |

Review Comment:
   I am not very familiar with the Iceberg spec. Does this accommodate for the 
geographic case where the "minimum" longitude is larger than the "maximum" 
longitude (or if you want the lower bound is larger than the upper bound for 
longitudes)?



##########
format/spec.md:
##########
@@ -205,13 +205,18 @@ Supported primitive types are defined in the table below. 
Primitive types added
 |                  | **`uuid`**         | Universally unique identifiers       
                                    | Should use 16-byte fixed                  
       |
 |                  | **`fixed(L)`**     | Fixed-length byte array of length L  
                                    |                                           
       |
 |                  | **`binary`**       | Arbitrary-length byte array          
                                    |                                           
       |
+| [v3](#version-3) | **`geometry(C)`**  | Geometry features from [OGC – Simple 
feature access](https://portal.ogc.org/files/?artifact_id=25355). Edges 
interpolation is always linear/planar. See [Appendix 
G](#appendix-g-geospatial-notes). Parameterized by crs C [4]. If not specified, 
C is `OGC:CRS84`. |                                                        |

Review Comment:
   ```suggestion
   | [v3](#version-3) | **`geometry(C)`**  | Geometry features from [OGC – 
Simple feature access](https://portal.ogc.org/files/?artifact_id=25355). Edges 
interpolation is always linear. See [Appendix G](#appendix-g-geospatial-notes). 
Parameterized by crs C [4]. If not specified, C is `OGC:CRS84`. |               
                                         |
   ```
   In this sentence I see "edges" and "linear/planar". I am not sure what 
"planar edge" means from the mathematical point of view, one could assume that 
we are talking about a linear edge embedded in a plane. This does not fit very 
well for geometries with more than 2 dimensions (I feel it is confusing).
   I think it is more than enough to say just "linear".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to