javrasya commented on PR #9464: URL: https://github.com/apache/iceberg/pull/9464#issuecomment-1890390386
Good catches @pvary , thank you. What if we get full inspiration from writeUTF and have our own writer but supports longer JSON. Btw, the reason why it limits the size to be 65K max because the first 2 bytes of the serialized value holds the length of the UTF and that is unsigned short which can be max 65K. I have introduced my own writeUTF and called it writeLongUTF/readLongUTF. It writes the first bytes which holds the length as int which is 4 bytes instead of unsigned short. Do mind taking a look at [those changes here](https://github.com/apache/iceberg/compare/main...javrasya:iceberg:issue-9410-implement-custom-utf-serde) and let me know what you think? I didn't want to update this PR directly without talking to you about it? If you think that is good idea, I can proceed and merge it on this branch first and we can continue with the discussions here. But it is not compatible with V2 since that is using initial 2 bytes to indicate the length. Introducing v3 is good idea as you suggested. But not really sure how we would be able to distinguish a serialized split with V2 earlier from V3 though 🤔 Do you know how this was done from v1 to v2? Can you help me there? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org