Re: [I] Discussion: Rethink `PrimitiveLiteral`. [iceberg-rust]

via GitHub Tue, 09 Jan 2024 08:05:12 -0800


Fokko commented on issue #159:
URL: https://github.com/apache/iceberg-rust/issues/159#issuecomment-1883333878


   > Sometimes it's self contained, for simple types such as int, date, string.
   > Sometimes it's not self contained, for example decimal.
   
   What do you mean by self-contained? For example, why is a date 
self-contained, and a decimal not? 
   
   Let me elaborate on how we do things in PyIceberg, because I think the 
approach we have there is quite similar, but I don't think we need to have two 
distinct types of literals. I just want to make sure that we make the right 
decision here, because this will be hard to undo later on. Please bear with me 
here, since Rust is not OOP, I might not be able to comprehend the whole issue 
here 😛 But I'm sure that this might help as some food for thought and help us 
to find the right solution.
   
   Within PyIceberg we have `Literal` 
[ABC](https://docs.python.org/3/library/abc.html) that has all the primitive 
(str, int, bytes, float, double, decimal, etc), but also the logical ones 
(date, uuid, etc). When you first do the expression `a < 3.25` we have the 
[function `literal(val: 
Any)`](https://github.com/apache/iceberg-python/blob/67c07f2ea745ac45ff7930ecd6fc5b98b88b5328/pyiceberg/expressions/literals.py#L122-L149)
 that's called by the parser. This will result in `LessThan(Ref('a'), 
FloatLiteral(3.25))` in this case because that's the representation in Python 
(`type(3.25) => <class 'float'>`). But the point is that we don't care at this 
moment because we don't know yet the type of the schema. Once we start planning 
the query, we'll look up the type of `Ref('a')` in the schema, and we find out 
that it is a Decimal. What we do then, is 
`FloatLiteral(3.25).to(DecimalType())` and we try to parse it to a decimal. If 
it fails, we'll raise an exception.
   
   I think this goes back to an earlier discussion:
   
   ```
   pub enum Datum {
      bool(bool),
       int(i32),
       ...
       datetime(DateTime<Utc>),
       decimal(Decimal),
   }
   ```
   
   Internally, the DateTime in Iceberg is milliseconds since epoch. And I think 
that also should be the internal representation for simplicity and performance 
reasons. What I would suggest is to have a similar structure where you can just 
accept a `DateTime<Utc>`, `i32`, `i64` that later can be mapped to the actual 
type. This way you give the user the freedom to use the type that they like, 
but also the flexibility to use the same structure when you parse SQL (where 
you can just assume that everything is a string, and you'll parse it to the 
correct type when binding).
   
   > And we also should deal with the case that if Decimal(Decimal) is i256.
   
   With the approach mentioned above, I don't think this is a problem, since we 
know the size of the Decimal when binding to the type (we just take the size 
that's appropriate for the precision of the type). Looking ahead, also 
nanosecond timestamps are being added with just the physical type i64, but 
there is a fair chance that i128 might be added later.
   
   Another similar situation that tackles this approach is schema evolution. 
Let's say that at some point you promote a float to double field. In that case, 
you want to bind to the _file-schema_, this way you'll convert the literal to a 
float when you have older datafiles where the field is still a float, and 
you'll convert it to a double in case the datafiles have been written with the 
new schema. Fun thing, if you then evaluate `float.max + 1` against a float 
field, you can just return `AlwaysFalse()` without having to open the files at 
all :)
   
   I hope this helps, and let me know if there are any other questions, of if 
you like me to elaborate on something in particular.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Rethink `PrimitiveLiteral`. [iceberg-rust]

Reply via email to