Apache9 commented on PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1828987729

   > > Anyway, the conversion is based on the fact that the user defines the 
column as string and wants to use it as a string. If you think there is an 
inappropriate scenario, could you give an example?
   > 
   > User sets the column as a string but the data is not UTF-8 encoded. Or 
worse, some files do have UTF-8 encoded binary and others do not.
   
   I think the logic here is straight forward for our users? If your data is 
not a string than you should not annotate it as a string right? If users define 
it as a string, it is the users' duty to make sure that the binary is a string.
   
   And about the 'unsafe' problem.
   
   WIthout the PR here, if users specify a binary data as string, it will just 
return byte array directly and the upper layer will crash because of type 
mismatch. And with the PR here, if the data is UTF-8 encoded, we are happy as 
our code could pass now. If the data is not UTF-8 encoded, we will crash, which 
is the same result before this PR.
   
   So in general, the PR here does not add new crashing scenarios right?
   
   Or at least, maybe we could introduce a fallback option here? If a binary 
data is annotated as string but no encoding information provided, we should use 
the fallback encoding to decode it?
   
   WDYT? @RussellSpitzer 
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to