Re: [PR] Arrow: Fix indexing in Parquet dictionary encoded values readers [iceberg]

via GitHub Tue, 08 Oct 2024 12:25:42 -0700


wypoon commented on code in PR #11247:
URL: https://github.com/apache/iceberg/pull/11247#discussion_r1792370324



##########
spark/v3.5/spark/src/test/resources/decimal_dict_and_plain_encoding.parquet:
##########


Review Comment:
   @nastra the code in `iceberg-parquet` still has to use the `parquet-java` 
code underneath to do the actual writing, and when using v1, dictionary 
encoding does not appear to be supported for fixed_len_byte_array.
   When I use
   ```
       Schema schema =
           new Schema(Types.NestedField.required(1, "dec_38_0", 
Types.DecimalType.of(38, 0)));
       File parquetFile = File.createTempFile("junit", null, temp.toFile());
       assertThat(parquetFile.delete()).as("Delete should succeed").isTrue();
   
       Iterable<GenericData.Record> records = RandomData.generate(schema, 500, 
0L, 0.0f);
       try (FileAppender<GenericData.Record> writer =
           Parquet.write(Files.localOutput(parquetFile))
               .schema(schema)
               .set(PARQUET_DICT_SIZE_BYTES, "2048")
               .set(PARQUET_PAGE_ROW_LIMIT, "100")
               .build()) {
         writer.addAll(records);
       }
   ```
   that writes a Parquet file with
   ```
   Column: dec_38_0
   
--------------------------------------------------------------------------------
     page   type  enc  count   avg size   size       rows     nulls   min / max
     0-0    data  G _  100     16.00 B    1.563 kB                    
     0-1    data  G _  100     16.00 B    1.563 kB                    
     0-2    data  G _  100     16.00 B    1.563 kB                    
     0-3    data  G _  100     16.00 B    1.563 kB                    
     0-4    data  G _  100     16.00 B    1.563 kB                    
   
   ```
   while if I use v2:
   ```
       try (FileAppender<GenericData.Record> writer =
           Parquet.write(Files.localOutput(parquetFile))
               .schema(schema)
               .set(PARQUET_DICT_SIZE_BYTES, "2048")
               .set(PARQUET_PAGE_ROW_LIMIT, "100")
               .writerVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
               .build()) {
         writer.addAll(records);
       }
   ```
   that writes a Parquet file with
   ```
   Column: dec_38_0
   
--------------------------------------------------------------------------------
     page   type  enc  count   avg size   size       rows     nulls   min / max
     0-D    dict  G _  96      16.00 B    1.500 kB  
     0-1    data  _ R  100     0.93 B     93 B       100      0       
     0-2    data  _ D  100     15.56 B    1.520 kB   100      0       
     0-3    data  _ D  100     14.76 B    1.441 kB   100      0       
     0-4    data  _ D  100     15.47 B    1.511 kB   100      0       
     0-5    data  _ D  100     15.06 B    1.471 kB   100      0       
   
   ```
   As you can see, I am using the APIs in `iceberg-parquet`, generating the 
same data in both cases, using the same dictionary size and page row limit; in 
the v1 case, plain encoding is used for all the pages, while in the v2 case, 
one page is written with dictionary encoding (unfortunately the other pages are 
written with DELTA_BYTE_ARRAY encoding).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Arrow: Fix indexing in Parquet dictionary encoded values readers [iceberg]

Reply via email to