Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2024-09-22 Thread via GitHub
github-actions[bot] commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-2367034156 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2024-05-07 Thread via GitHub
chenwyi2 commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-2097906401 > > > * It seems that ORC is not experiencing this issue because it creates value reader based on the iceberg column types. > > > * Avro reads the fields entirely based on the file typ

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2024-01-08 Thread via GitHub
openinx commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1880545421 @advancedxy Thanks for the context from spark. There was a mail discussing this issue, and we agreed that providing an option to promote binary to string is the correct approach. I think

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2024-01-04 Thread via GitHub
advancedxy commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1876803507 Spark adds a configuration to control whether to treat binary as string in parquet reader, see: https://github.com/apache/spark/blob/c8137960a0ba725d1633795a057c68f2bbef414b/sql/cataly

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-11-28 Thread via GitHub
Apache9 commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1829524111 > It still feels like we are adding a special workaround to Iceberg for something which shouldn't be happening in the first place. Can you explain the use case again? Why can't the upstre

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-11-27 Thread via GitHub
RussellSpitzer commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1829023255 It still feels like we are adding a special workaround to Iceberg for something which shouldn't be happening in the first place. Can you explain the use case again? Why can't the

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-11-27 Thread via GitHub
Apache9 commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1828987729 > > Anyway, the conversion is based on the fact that the user defines the column as string and wants to use it as a string. If you think there is an inappropriate scenario, could you give

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-11-27 Thread via GitHub
RussellSpitzer commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1828351888 > > Anyway, the conversion is based on the fact that the user defines the column as string and wants to use it as a string. If you think there is an inappropriate scenario, cou

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-11-26 Thread via GitHub
fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1826731977 > You can only guarantee this is safe for your data, for any other user this could be unsafe. That’s the underlying issue with this PR, we are essentially allowing a cast binary as str

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-11-25 Thread via GitHub
RussellSpitzer commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1826303884 You can only guarantee this is safe for your data, for any other user this could be unsafe. That’s the underlying issue with this PR, we are essentially allowing a cast binary as s

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-11-24 Thread via GitHub
fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1825487523 > I'm also a little nervous about this change, how are we guaranteed that the binary is parsable as UTF8 bytes? Seems like we should just be fixing the type annotations rather than cha

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-19 Thread via GitHub
fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1772038159 > My thoughts were more like, if we have a column defined as "double" we may allow "float" in the file definition but we wouldn't allow Binary. So how is this different? @Russel

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-19 Thread via GitHub
RussellSpitzer commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1771952316 My thoughts were more like, if we have a column defined as "double" we may allow "float" in the file definition but we wouldn't allow Binary. So how is this different? -- This i

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-19 Thread via GitHub
fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1771924888 > how are we guaranteed that the binary is parsable as UTF8 bytes? @RussellSpitzer Thank you for participating in the review. If a column is not encoded in UTF-8, it should not

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-19 Thread via GitHub
RussellSpitzer commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1771569346 I'm also a little nervous about this change, how are we guaranteed that the binary is parsable as UTF8 bytes? Seems like we should just be fixing the type annotations rather than c

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-19 Thread via GitHub
pvary commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1770228648 @RussellSpitzer: Could you please take a look at the Spark change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-18 Thread via GitHub
fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1769826927 > > * It seems that ORC is not experiencing this issue because it creates value reader based on the iceberg column types. > > * Avro reads the fields entirely based on the file type,

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-18 Thread via GitHub
pvary commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1768291556 > * It seems that ORC is not experiencing this issue because it creates value reader based on the iceberg column types. > * Avro reads the fields entirely based on the file type, which se

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub
nastra commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1767781293 > Additionally, Iceberg has a UUID type, which seems to be supported in Spark but not in Flink: https://github.com/apache/iceberg/pull/7496 So Spark itself doesn't support UUIDs as a

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub
fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1767764825 > @fengjiajie: Checked the codepath for the Spark readers and I have 2 questions: > > * What about ORC and Avro files? Don't we have the same issue there? > * Would it worth t

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub
pvary commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1767664345 @fengjiajie: Checked the codepath for the Spark readers and I have 2 questions: - What about ORC and Avro files? Don't we have the same issue there? - Would it worth to add the same fi

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub
fengjiajie commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361899155 ## flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java: ## @@ -81,26 +75,87 @@ public void testTwoLevelList() throws IOExceptio

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub
fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1766144239 > This is a small change, so it might not be too hard to keep the different Flink version changes in sync, but usually we introduce the changes on the latest Flink, and then create a d

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub
fengjiajie commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361854341 ## flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java: ## @@ -262,7 +262,11 @@ public ParquetValueReader primitive( switch

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub
pvary commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1765954472 This is a small change, so it might not be too hard to keep the different Flink version changes in sync, but usually we introduce the changes on the latest Flink, and then create a differen

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub
pvary commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361742942 ## flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java: ## @@ -81,26 +75,87 @@ public void testTwoLevelList() throws IOException {

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub
pvary commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361742600 ## flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java: ## @@ -262,7 +262,11 @@ public ParquetValueReader primitive( switch (pri

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-16 Thread via GitHub
fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1765499071 > > Some systems like older versions of Impala do not annotate String type as UTF-8 columns in Parquet files. When importing these Parquet files into Iceberg, reading these Binary colu

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-16 Thread via GitHub
nastra commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1764681891 > Some systems like older versions of Impala do not annotate String type as UTF-8 columns in Parquet files. When importing these Parquet files into Iceberg, reading these Binary columns wi

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-13 Thread via GitHub
fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1762465495 @nastra Thank you for taking the time to review my code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-13 Thread via GitHub
nastra commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1358331014 ## flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java: ## @@ -90,17 +92,72 @@ public void testTwoLevelList() throws IOException {

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-13 Thread via GitHub
nastra commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1358327260 ## flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java: ## @@ -90,17 +92,72 @@ public void testTwoLevelList() throws IOException {

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-13 Thread via GitHub
nastra commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1358326874 ## flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java: ## @@ -90,17 +92,72 @@ public void testTwoLevelList() throws IOException {

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-13 Thread via GitHub
nastra commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1358102450 ## flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java: ## @@ -98,6 +100,62 @@ public void testTwoLevelList() throws IOException {

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-13 Thread via GitHub
nastra commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1761210176 @fengjiajie thanks for working on this. Could you please add a test that reproduces the issue? -- This is an automated message from the Apache Git Service. To respond to the message, ple