Apache9 commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1828987729
> > Anyway, the conversion is based on the fact that the user defines the column as string and wants to use it as a string. If you think there is an inappropriate scenario, could you give an example? > > User sets the column as a string but the data is not UTF-8 encoded. Or worse, some files do have UTF-8 encoded binary and others do not. I think the logic here is straight forward for our users? If your data is not a string than you should not annotate it as a string right? If users define it as a string, it is the users' duty to make sure that the binary is a string. And about the 'unsafe' problem. WIthout the PR here, if users specify a binary data as string, it will just return byte array directly and the upper layer will crash because of type mismatch. And with the PR here, if the data is UTF-8 encoded, we are happy as our code could pass now. If the data is not UTF-8 encoded, we will crash, which is the same result before this PR. So in general, the PR here does not add new crashing scenarios right? Or at least, maybe we could introduce a fallback option here? If a binary data is annotated as string but no encoding information provided, we should use the fallback encoding to decode it? WDYT? @RussellSpitzer Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
