Re: [PR] Spec: Support geo type [iceberg]

via GitHub Fri, 20 Dec 2024 08:31:58 -0800


paleolimbot commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1894143585



##########
format/spec.md:
##########
@@ -584,8 +589,8 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 | _optional_ | _optional_ | _optional_ | **`110  null_value_counts`**      | 
`map<121: int, 122: long>`                                                  | 
Map from column id to number of null values in the column                       
                                                                                
                                                   |
 | _optional_ | _optional_ | _optional_ | **`137  nan_value_counts`**       | 
`map<138: int, 139: long>`                                                  | 
Map from column id to number of NaN values in the column                        
                                                                                
                                                   |
 | _optional_ | _optional_ | _optional_ | **`111  distinct_counts`**        | 
`map<123: int, 124: long>`                                                  | 
Map from column id to number of distinct values in the column; distinct counts 
must be derived using values in the file by counting or using sketches, but not 
using methods like merging existing distinct counts |
-| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**           | 
`map<126: int, 127: binary>`                                                | 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]                                     |
-| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**           | 
`map<129: int, 130: binary>`                                                | 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]                                  |
+| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**           | 
`map<126: int, 127: binary>`                                                | 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]. See [7] for`geometry` and [8] for `geography`.  |
+| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**           | 
`map<129: int, 130: binary>`                                                | 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]. See [9] for `geometry` and [10] for `geography`. |

Review Comment:
   Apologies for not including that option...I did try to type it but it was 
hard (as you noted) to fit it into the existing bullet points. Here's a version:
   
   ```
   7. `geometry`, this is a point: X, Y, Z, and M take the min value of all 
component points of all geometries in file. For the X value only, this is 
permitted to be greater than the maximum X value, in which case potential 
intersection is implied by `x >= xmin` OR `x <= xmax`. See Appendix D for 
encoding.
   8. `geography`, this is a point: X = westernmost bound of all geometries in 
file, Y = northernmost bound of all geometries in file, Z is min value for all 
component points of all geometries in the file, M is min value of all component 
points of all geometries in the file. See Appendix D for encoding.
   9. `geometry`, this is a point: X, Y, Z, and M take the max value of all 
component points of all geometries in file. For the X value only, this is 
permitted to be less than the minimum X value, in which case potential 
intersection is implied by `x >= xmin` OR `x <= xmax`. See Appendix D for 
encoding.
   10. `geography`, this is a point: X = easternmost bound of all geometries in 
file, Y = southernmost bound of all geometries in file, Z is max value for all 
component points of all geometries in the file, M is max value of all component 
points of all geometries in the file. See Appendix D for encoding.
   11. `geography`, the concepts of westernmost and easternmost values are 
explicitly introduced to address cases involving anti-meridian crossing, where 
the `lower_bound` may be greater than `upper_bound`. For `geometry` we use a 
mathematical definition to ensure that implementations do not need to consider 
the CRS when checking two boxes for potential intersection.  The canonical 
ranges for the bounding box covering all points in the coordinate system is 
[-180 180] for the west-east range and [-90 90] for the south-north range.
   ```
   
   We can't quite collapse the geography definition into the geometry 
definition (or at least, I can't figure out the language to do it concicesly) 
because `component points` is not the right concept for geography (bounds could 
be the northern/southern extent of a curved-in-lon-lat-Cartesian-space edge).
   
   > filter pushdown on antimeridian crossing objects in Geometry type is 
impossible
   
   I would perhaps phrase it as "highly ineffective" (which is still a good 
reason to include this! 🙂 )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Support geo type [iceberg]

Reply via email to