Fokko commented on issue #159: URL: https://github.com/apache/iceberg-rust/issues/159#issuecomment-1883333878
> Sometimes it's self contained, for simple types such as int, date, string. > Sometimes it's not self contained, for example decimal. What do you mean by self-contained? For example, why is a date self-contained, and a decimal not? Let me elaborate on how we do things in PyIceberg, because I think the approach we have there is quite similar, but I don't think we need to have two distinct types of literals. I just want to make sure that we make the right decision here, because this will be hard to undo later on. Please bear with me here, since Rust is not OOP, I might not be able to comprehend the whole issue here 😛 But I'm sure that this might help as some food for thought and help us to find the right solution. Within PyIceberg we have `Literal` [ABC](https://docs.python.org/3/library/abc.html) that has all the primitive (str, int, bytes, float, double, decimal, etc), but also the logical ones (date, uuid, etc). When you first do the expression `a < 3.25` we have the [function `literal(val: Any)`](https://github.com/apache/iceberg-python/blob/67c07f2ea745ac45ff7930ecd6fc5b98b88b5328/pyiceberg/expressions/literals.py#L122-L149) that's called by the parser. This will result in `LessThan(Ref('a'), FloatLiteral(3.25))` in this case because that's the representation in Python (`type(3.25) => <class 'float'>`). But the point is that we don't care at this moment because we don't know yet the type of the schema. Once we start planning the query, we'll look up the type of `Ref('a')` in the schema, and we find out that it is a Decimal. What we do then, is `FloatLiteral(3.25).to(DecimalType())` and we try to parse it to a decimal. If it fails, we'll raise an exception. I think this goes back to an earlier discussion: ``` pub enum Datum { bool(bool), int(i32), ... datetime(DateTime<Utc>), decimal(Decimal), } ``` Internally, the DateTime in Iceberg is milliseconds since epoch. And I think that also should be the internal representation for simplicity and performance reasons. What I would suggest is to have a similar structure where you can just accept a `DateTime<Utc>`, `i32`, `i64` that later can be mapped to the actual type. This way you give the user the freedom to use the type that they like, but also the flexibility to use the same structure when you parse SQL (where you can just assume that everything is a string, and you'll parse it to the correct type when binding). > And we also should deal with the case that if Decimal(Decimal) is i256. With the approach mentioned above, I don't think this is a problem, since we know the size of the Decimal when binding to the type (we just take the size that's appropriate for the precision of the type). Looking ahead, also nanosecond timestamps are being added with just the physical type i64, but there is a fair chance that i128 might be added later. Another similar situation that tackles this approach is schema evolution. Let's say that at some point you promote a float to double field. In that case, you want to bind to the _file-schema_, this way you'll convert the literal to a float when you have older datafiles where the field is still a float, and you'll convert it to a double in case the datafiles have been written with the new schema. Fun thing, if you then evaluate `float.max + 1` against a float field, you can just return `AlwaysFalse()` without having to open the files at all :) I hope this helps, and let me know if there are any other questions, of if you like me to elaborate on something in particular. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org