Re: [PR] Spec: Support geo type [iceberg]

via GitHub Thu, 19 Dec 2024 10:07:25 -0800


paleolimbot commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1892944301



##########
format/spec.md:
##########
@@ -584,8 +589,8 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 | _optional_ | _optional_ | _optional_ | **`110  null_value_counts`**      | 
`map<121: int, 122: long>`                                                  | 
Map from column id to number of null values in the column                       
                                                                                
                                                   |
 | _optional_ | _optional_ | _optional_ | **`137  nan_value_counts`**       | 
`map<138: int, 139: long>`                                                  | 
Map from column id to number of NaN values in the column                        
                                                                                
                                                   |
 | _optional_ | _optional_ | _optional_ | **`111  distinct_counts`**        | 
`map<123: int, 124: long>`                                                  | 
Map from column id to number of distinct values in the column; distinct counts 
must be derived using values in the file by counting or using sketches, but not 
using methods like merging existing distinct counts |
-| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**           | 
`map<126: int, 127: binary>`                                                | 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]                                     |
-| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**           | 
`map<129: int, 130: binary>`                                                | 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]                                  |
+| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**           | 
`map<126: int, 127: binary>`                                                | 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]. See [7] for`geometry` and [8] for `geography`.  |
+| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**           | 
`map<129: int, 130: binary>`                                                | 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]. See [9] for `geometry` and [10] for `geography`. |

Review Comment:
   For what it's worth I also think this is a good idea (if worded accurately). 
I think there have been some explanations in past threads but in case the 
motivation is not clear: without the ability to generate a bounding box that 
crosses the antimeridian, writers would be forced to write incredibly large 
bounding boxes for very small features that happen to have edges on either wide 
of -180/180. A concrete way to grasp the type of performance improvement this 
gives is by searching for "Fiji" in Google Maps (which implements something 
like this wraparound bounding box and helpfully shows you Fiji) and 
OpenStreetMap (which doesn't, and unhelpfully shows you the whole world).
   
   I would personally consider this an optional quality-of-life improvement for 
users, who will be able to achieve faster queries because the bounding boxes 
are more appropriate.
   
   > And, yes, I think you should collapse 7+8 and 9+10.
   
   I believe the mathematical definition that would work for both (because a 
non-geography type has no concept of west/east/north/south) would be:
   
   Whereas the case where `xmin <= xmax` requires that `x >= xmin` AND  `x <= 
xmax` to imply potential intersection, the case where `xmin > xmax` implies a 
potential intersection if `x >= xmin` OR `x <= xmax`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Support geo type [iceberg]

Reply via email to