rdblue commented on code in PR #10831:
URL: https://github.com/apache/iceberg/pull/10831#discussion_r1815771516


##########
format/spec.md:
##########
@@ -1297,54 +1308,56 @@ Example
 
 This serialization scheme is for storing single values as individual binary 
values in the lower and upper bounds maps of manifest files.
 
-| Type                         | Binary serialization                          
                                                               |
-|------------------------------|--------------------------------------------------------------------------------------------------------------|
-| **`unknown`**                | Not supported                                 
                                                               |
-| **`boolean`**                | `0x00` for false, non-zero byte for true      
                                                               |
-| **`int`**                    | Stored as 4-byte little-endian                
                                                               |
-| **`long`**                   | Stored as 8-byte little-endian                
                                                               |
-| **`float`**                  | Stored as 4-byte little-endian                
                                                               |
-| **`double`**                 | Stored as 8-byte little-endian                
                                                               |
-| **`date`**                   | Stores days from the 1970-01-01 in an 4-byte 
little-endian int                                               |
-| **`time`**                   | Stores microseconds from midnight in an 
8-byte little-endian long                                            |
-| **`timestamp`**              | Stores microseconds from 1970-01-01 
00:00:00.000000 in an 8-byte little-endian long                          |
-| **`timestamptz`**            | Stores microseconds from 1970-01-01 
00:00:00.000000 UTC in an 8-byte little-endian long                      |
-| **`timestamp_ns`**           | Stores nanoseconds from 1970-01-01 
00:00:00.000000000 in an 8-byte little-endian long                        |
-| **`timestamptz_ns`**         | Stores nanoseconds from 1970-01-01 
00:00:00.000000000 UTC in an 8-byte little-endian long                    |
-| **`string`**                 | UTF-8 bytes (without length)                  
                                                               |
-| **`uuid`**                   | 16-byte big-endian value, see example in 
Appendix B                                                          |
-| **`fixed(L)`**               | Binary value                                  
                                                               |
-| **`binary`**                 | Binary value (without length)                 
                                                               |
-| **`decimal(P, S)`**          | Stores unscaled value as two’s-complement 
big-endian binary, using the minimum number of bytes for the value |
-| **`struct`**                 | Not supported                                 
                                                               |
-| **`list`**                   | Not supported                                 
                                                               |
-| **`map`**                    | Not supported                                 
                                                               |
+| Type                 | Binary serialization                                  
                                                       |
+|----------------------|--------------------------------------------------------------------------------------------------------------|
+| **`unknown`**        | Not supported                                         
                                                       |
+| **`boolean`**        | `0x00` for false, non-zero byte for true              
                                                       |
+| **`int`**            | Stored as 4-byte little-endian                        
                                                       |
+| **`long`**           | Stored as 8-byte little-endian                        
                                                       |
+| **`float`**          | Stored as 4-byte little-endian                        
                                                       |
+| **`double`**         | Stored as 8-byte little-endian                        
                                                       |
+| **`date`**           | Stores days from the 1970-01-01 in an 4-byte 
little-endian int                                               |
+| **`time`**           | Stores microseconds from midnight in an 8-byte 
little-endian long                                            |
+| **`timestamp`**      | Stores microseconds from 1970-01-01 00:00:00.000000 
in an 8-byte little-endian long                          |
+| **`timestamptz`**    | Stores microseconds from 1970-01-01 00:00:00.000000 
UTC in an 8-byte little-endian long                      |
+| **`timestamp_ns`**   | Stores nanoseconds from 1970-01-01 00:00:00.000000000 
in an 8-byte little-endian long                        |
+| **`timestamptz_ns`** | Stores nanoseconds from 1970-01-01 00:00:00.000000000 
UTC in an 8-byte little-endian long                    |
+| **`string`**         | UTF-8 bytes (without length)                          
                                                       |
+| **`uuid`**           | 16-byte big-endian value, see example in Appendix B   
                                                       |
+| **`fixed(L)`**       | Binary value                                          
                                                       |
+| **`binary`**         | Binary value (without length)                         
                                                       |
+| **`decimal(P, S)`**  | Stores unscaled value as two’s-complement big-endian 
binary, using the minimum number of bytes for the value |
+| **`struct`**         | Not supported                                         
                                                       |
+| **`list`**           | Not supported                                         
                                                       |
+| **`map`**            | Not supported                                         
                                                       |
+| **`variant`**        | Not supported                                         
                                                       |
 
 ### JSON single-value serialization
 
  Single values are serialized as JSON by type according to the following table:
 
-| Type               | JSON representation                       | Example     
                               | Description                                    
                                                                             |
-| ------------------ | ----------------------------------------- | 
------------------------------------------ | -- |
-| **`boolean`**      | **`JSON boolean`**                        | `true`      
                               | |
-| **`int`**          | **`JSON int`**                            | `34`        
                               | |
-| **`long`**         | **`JSON long`**                           | `34`        
                               | |
-| **`float`**        | **`JSON number`**                         | `1.0`       
                               | |
-| **`double`**       | **`JSON number`**                         | `1.0`       
                               | |
-| **`decimal(P,S)`** | **`JSON string`**                         | `"14.20"`, 
`"2E+20"`                       | Stores the string representation of the 
decimal value, specifically, for values with a positive scale, the number of 
digits to the right of the decimal point is used to indicate scale, for values 
with a negative scale, the scientific notation is used and the exponent must 
equal the negated scale |
-| **`date`**         | **`JSON string`**                         | 
`"2017-11-16"`                             | Stores ISO-8601 standard date |
-| **`time`**         | **`JSON string`**                         | 
`"22:31:08.123456"`                        | Stores ISO-8601 standard time with 
microsecond precision |
-| **`timestamp`**    | **`JSON string`**                         | 
`"2017-11-16T22:31:08.123456"`             | Stores ISO-8601 standard timestamp 
with microsecond precision; must not include a zone offset |
-| **`timestamptz`**  | **`JSON string`**                         | 
`"2017-11-16T22:31:08.123456+00:00"`       | Stores ISO-8601 standard timestamp 
with microsecond precision; must include a zone offset and it must be '+00:00' |
-| **`timestamp_ns`** | **`JSON string`**                         | 
`"2017-11-16T22:31:08.123456789"`          | Stores ISO-8601 standard timestamp 
with nanosecond precision; must not include a zone offset |
-| **`timestamptz_ns`** | **`JSON string`**                       | 
`"2017-11-16T22:31:08.123456789+00:00"`    | Stores ISO-8601 standard timestamp 
with nanosecond precision; must include a zone offset and it must be '+00:00' |
-| **`string`**       | **`JSON string`**                         | `"iceberg"` 
                               | |
-| **`uuid`**         | **`JSON string`**                         | 
`"f79c3e09-677c-4bbd-a479-3f349cb785e7"`   | Stores the lowercase uuid string |
-| **`fixed(L)`**     | **`JSON string`**                         | 
`"000102ff"`                               | Stored as a hexadecimal string |
-| **`binary`**       | **`JSON string`**                         | 
`"000102ff"`                               | Stored as a hexadecimal string |
-| **`struct`**       | **`JSON object by field ID`**             | `{"1": 1, 
"2": "bar"}`                     | Stores struct fields using the field ID as 
the JSON field name; field values are stored using this JSON single-value 
format |
-| **`list`**         | **`JSON array of values`**                | `[1, 2, 3]` 
                               | Stores a JSON array of values that are 
serialized using this JSON single-value format |
-| **`map`**          | **`JSON object of key and value arrays`** | `{ "keys": 
["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual 
keys and values are serialized using this JSON single-value format |
+| Type                 | JSON representation                                   
       | Example                                    | Description               
                                                                                
                                                                                
                                                                                
                               |
+|----------------------|--------------------------------------------------------------|--------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **`boolean`**        | **`JSON boolean`**                                    
       | `true`                                     |                           
                                                                                
                                                                                
                                                                                
                               |
+| **`int`**            | **`JSON int`**                                        
       | `34`                                       |                           
                                                                                
                                                                                
                                                                                
                               |
+| **`long`**           | **`JSON long`**                                       
       | `34`                                       |                           
                                                                                
                                                                                
                                                                                
                               |
+| **`float`**          | **`JSON number`**                                     
       | `1.0`                                      |                           
                                                                                
                                                                                
                                                                                
                               |
+| **`double`**         | **`JSON number`**                                     
       | `1.0`                                      |                           
                                                                                
                                                                                
                                                                                
                               |
+| **`decimal(P,S)`**   | **`JSON string`**                                     
       | `"14.20"`, `"2E+20"`                       | Stores the string 
representation of the decimal value, specifically, for values with a positive 
scale, the number of digits to the right of the decimal point is used to 
indicate scale, for values with a negative scale, the scientific notation is 
used and the exponent must equal the negated scale |
+| **`date`**           | **`JSON string`**                                     
       | `"2017-11-16"`                             | Stores ISO-8601 standard 
date                                                                            
                                                                                
                                                                                
                                |
+| **`time`**           | **`JSON string`**                                     
       | `"22:31:08.123456"`                        | Stores ISO-8601 standard 
time with microsecond precision                                                 
                                                                                
                                                                                
                                |
+| **`timestamp`**      | **`JSON string`**                                     
       | `"2017-11-16T22:31:08.123456"`             | Stores ISO-8601 standard 
timestamp with microsecond precision; must not include a zone offset            
                                                                                
                                                                                
                                |
+| **`timestamptz`**    | **`JSON string`**                                     
       | `"2017-11-16T22:31:08.123456+00:00"`       | Stores ISO-8601 standard 
timestamp with microsecond precision; must include a zone offset and it must be 
'+00:00'                                                                        
                                                                                
                                |
+| **`timestamp_ns`**   | **`JSON string`**                                     
       | `"2017-11-16T22:31:08.123456789"`          | Stores ISO-8601 standard 
timestamp with nanosecond precision; must not include a zone offset             
                                                                                
                                                                                
                                |
+| **`timestamptz_ns`** | **`JSON string`**                                     
       | `"2017-11-16T22:31:08.123456789+00:00"`    | Stores ISO-8601 standard 
timestamp with nanosecond precision; must include a zone offset and it must be 
'+00:00'                                                                        
                                                                                
                                 |
+| **`string`**         | **`JSON string`**                                     
       | `"iceberg"`                                |                           
                                                                                
                                                                                
                                                                                
                               |
+| **`uuid`**           | **`JSON string`**                                     
       | `"f79c3e09-677c-4bbd-a479-3f349cb785e7"`   | Stores the lowercase uuid 
string                                                                          
                                                                                
                                                                                
                               |
+| **`fixed(L)`**       | **`JSON string`**                                     
       | `"000102ff"`                               | Stored as a hexadecimal 
string                                                                          
                                                                                
                                                                                
                                 |
+| **`binary`**         | **`JSON string`**                                     
       | `"000102ff"`                               | Stored as a hexadecimal 
string                                                                          
                                                                                
                                                                                
                                 |
+| **`struct`**         | **`JSON object by field ID`**                         
       | `{"1": 1, "2": "bar"}`                     | Stores struct fields 
using the field ID as the JSON field name; field values are stored using this 
JSON single-value format                                                        
                                                                                
                                      |
+| **`list`**           | **`JSON array of values`**                            
       | `[1, 2, 3]`                                | Stores a JSON array of 
values that are serialized using this JSON single-value format                  
                                                                                
                                                                                
                                  |
+| **`map`**            | **`JSON object of key and value arrays`**             
       | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and 
values; individual keys and values are serialized using this JSON single-value 
format                                                                          
                                                                                
                                |
+| **`variant`**        | **`Same JSON representation in this table for stored 
type`** | `null`, `true`, `{"1": 1, "2": "bar"}`      | The JSON representation 
matches the format shown in this table for the type stored in the Variant.      
                                                                                
                                                                                
                                 |

Review Comment:
   I think this is insufficient because it loses type information. For example, 
if there is a Variant that contains a timestamp, the timestamp's type is lost. 
This also appears to use the struct representation for an object, but a Variant 
has no field IDs.
   
   I think that we probably do want to use a JSON representation, but we will 
need to make sure that it aligns with the JSON conversion defined in the 
Variant spec 
(https://github.com/apache/parquet-format/pull/461/files#diff-80a56bf0d841087ab5038020e4b78119d6ceb44684719ba4d3a6e22effb36eb9R459)
 and that we have defined requirements for recovering the types.
   
   Note that the JSON conversion in the Variant spec differs from this section:
   * Binary values are hexadecimal strings here (:facepalm:) and base64 in 
Variant
   * Decimal values are strings here and numbers in Variant
   
   We may want to not allow Variant in JSON because of the type loss. Or we may 
want to specify a subset that can be recovered, like booleans, integers, 
floats, strings, and arrays. There are two places where this section is used: 
in the REST protocol for sending lower/upper bounds and in default values. We 
don't really need default values for variants (but would have to disallow them) 
and lower/upper bounds would generally work with a subset.
   
   The other option is to encode `variant` as a base64 binary value. That is a 
bit ugly in JSON, but it is not lossy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to