fengjiajie commented on PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1826731977

   > You can only guarantee this is safe for your data, for any other user this 
could be unsafe. That’s the underlying issue with this PR, we are essentially 
allowing a cast binary as string.Sent from my iPhoneOn Nov 24, 2023, at 4:47 
AM, fengjiajie ***@***.***> wrote: I'm also a little nervous about this 
change, how are we guaranteed that the binary is parsable as UTF8 bytes? Seems 
like we should just be fixing the type annotations rather than changing our 
readers to read files that have been written incorrectly? @RussellSpitzer Hi, 
can you please tell if this issue can be moved forward? We have a lot of hive 
tables that contain such parquet files and we are trying to convert these hive 
tables into iceberg tables, this process of parquet files cannot be rewritten 
(because of the large number of history files). We can guarantee that it could 
be parsed in UTF-8 because the data was originally defined as a string in hive. 
If it wasn't a string before, there's no reason defining 
 it as a string when defining the iceberg table would make it fail to parse. 
—Reply to this email directly, view it on GitHub, or unsubscribe.You are 
receiving this because you were mentioned.Message ID: ***@***.***>
   
   @RussellSpitzer Thanks for the reply, but I still don't get it.
   
   * I don't quite understand why this is 'unsafe' for any other user?
   * Also this conversion is only possible if the user defines the iceberg 
column as a string. A user defining a column as string means that the user 
wants to use the value of the column as a string (string is required by the 
iceberg specification to be UTF-8 encoded, and the library decodes it as UTF-8 
according to the specification), otherwise, if the user only wants the column 
to be used as a binary, the user should define the iceberg column as a binary 
type instead of a string type, and there would be no conversion.
   
   Anyway, the conversion is based on the fact that the user defines the column 
as string and wants to use it as a string. If you think there is an 
inappropriate scenario, could you give an example?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to