Re: [PR] Arrow: Fix indexing in Parquet dictionary encoded values readers [iceberg]

via GitHub Mon, 21 Oct 2024 10:43:47 -0700


RussellSpitzer commented on code in PR #11247:
URL: https://github.com/apache/iceberg/pull/11247#discussion_r1809238713



##########
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetDictionaryEncodedVectorizedReads.java:
##########
@@ -93,4 +125,64 @@ public void testMixedDictionaryNonDictionaryReads() throws 
IOException {
         true,
         BATCH_SIZE);
   }
+
+  @Test
+  public void testBinaryNotAllPagesDictionaryEncoded() throws IOException {
+    Schema schema = new Schema(Types.NestedField.required(1, "bytes", 
Types.BinaryType.get()));
+    File parquetFile = File.createTempFile("junit", null, temp.toFile());
+    assertThat(parquetFile.delete()).as("Delete should succeed").isTrue();
+
+    Iterable<GenericData.Record> records = 
RandomData.generateFallbackData(schema, 500, 0L, 100);
+    try (FileAppender<GenericData.Record> writer =
+        Parquet.write(Files.localOutput(parquetFile))
+            .schema(schema)
+            .set(PARQUET_DICT_SIZE_BYTES, "4096")
+            .set(PARQUET_PAGE_ROW_LIMIT, "100")
+            .build()) {
+      writer.addAll(records);
+    }
+
+    // After the above, parquetFile contains one column chunk of binary data 
in five pages,
+    // the first two RLE dictionary encoded, and the remaining three plain 
encoded.
+    assertRecordsMatch(schema, 500, records, parquetFile, true, BATCH_SIZE);
+  }
+
+  /**
+   * decimal_dict_and_plain_encoding.parquet contains one column chunk of 
decimal(38, 0) data in two
+   * pages, one RLE dictionary encoded and one plain encoded, each with 200 
rows.
+   */
+  @Test
+  public void testDecimalNotAllPagesDictionaryEncoded() throws Exception {
+    Schema schema = new Schema(Types.NestedField.required(1, "id", 
Types.DecimalType.of(38, 0)));
+    Path path =
+        Paths.get(
+            getClass()
+                .getClassLoader()
+                .getResource("decimal_dict_and_plain_encoding.parquet")
+                .toURI());

Review Comment:
   OK so if I read through those threads, the issue is that only a Parquet V2 
writer can produce the file you want but we don't actually support reading V2 
Parquet files in our vectorized readers in Iceberg. I know Spark supports this 
in its readers now so we should probably fix up our read code as well.
   
   I think it's fine to keep this file as a temporary test situation but long 
term we should fix our V2 writers/readers so that we can have this all self 
contained.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Arrow: Fix indexing in Parquet dictionary encoded values readers [iceberg]

Reply via email to