wypoon commented on code in PR #11247: URL: https://github.com/apache/iceberg/pull/11247#discussion_r1792370324
########## spark/v3.5/spark/src/test/resources/decimal_dict_and_plain_encoding.parquet: ########## Review Comment: @nastra the code in `iceberg-parquet` still has to use the `parquet-java` code underneath to do the actual writing, and when using v1, dictionary encoding does not appear to be supported for fixed_len_byte_array. When I use ``` Schema schema = new Schema(Types.NestedField.required(1, "dec_38_0", Types.DecimalType.of(38, 0))); File parquetFile = File.createTempFile("junit", null, temp.toFile()); assertThat(parquetFile.delete()).as("Delete should succeed").isTrue(); Iterable<GenericData.Record> records = RandomData.generate(schema, 500, 0L, 0.0f); try (FileAppender<GenericData.Record> writer = Parquet.write(Files.localOutput(parquetFile)) .schema(schema) .set(PARQUET_DICT_SIZE_BYTES, "2048") .set(PARQUET_PAGE_ROW_LIMIT, "100") .build()) { writer.addAll(records); } ``` that writes a Parquet file with ``` Column: dec_38_0 -------------------------------------------------------------------------------- page type enc count avg size size rows nulls min / max 0-0 data G _ 100 16.00 B 1.563 kB 0-1 data G _ 100 16.00 B 1.563 kB 0-2 data G _ 100 16.00 B 1.563 kB 0-3 data G _ 100 16.00 B 1.563 kB 0-4 data G _ 100 16.00 B 1.563 kB ``` while if I use v2: ``` try (FileAppender<GenericData.Record> writer = Parquet.write(Files.localOutput(parquetFile)) .schema(schema) .set(PARQUET_DICT_SIZE_BYTES, "2048") .set(PARQUET_PAGE_ROW_LIMIT, "100") .writerVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { writer.addAll(records); } ``` that writes a Parquet file with ``` Column: dec_38_0 -------------------------------------------------------------------------------- page type enc count avg size size rows nulls min / max 0-D dict G _ 96 16.00 B 1.500 kB 0-1 data _ R 100 0.93 B 93 B 100 0 0-2 data _ D 100 15.56 B 1.520 kB 100 0 0-3 data _ D 100 14.76 B 1.441 kB 100 0 0-4 data _ D 100 15.47 B 1.511 kB 100 0 0-5 data _ D 100 15.06 B 1.471 kB 100 0 ``` As you can see, I am using the APIs in `iceberg-parquet`, generating the same data in both cases, using the same dictionary size and page row limit; in the v1 case, plain encoding is used for all the pages, while in the v2 case, one page is written with dictionary encoding (unfortunately the other pages are written with DELTA_BYTE_ARRAY encoding). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org