Re: [PR] Add table property to disable/enable parquet column statistics #12770 [iceberg]

via GitHub Tue, 15 Apr 2025 15:52:56 -0700


dramaticlly commented on code in PR #12771:
URL: https://github.com/apache/iceberg/pull/12771#discussion_r2045632526



##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -174,6 +174,13 @@ private TableProperties() {}
   public static final String PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX =
       "write.parquet.bloom-filter-enabled.column.";
 
+  public static final String PARQUET_COLUMN_STATS_ENABLED_PREFIX =
+      "write.parquet.stats-enabled.column.";
+
+  public static final String DEFAULT_PARQUET_COLUMN_STATS_ENABLED =
+      PARQUET_COLUMN_STATS_ENABLED_PREFIX + "default";
+  public static final boolean DEFAULT_PARQUET_COLUMN_STATS_ENABLED_DEFAULT = 
true;

Review Comment:
   how about 
   - PARQUET_COLUMN_STATS_ENABLED
   - PARQUET_COLUMN_STATS_ENABLED_DEFAULT



##########
parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java:
##########
@@ -219,6 +221,50 @@ public void testTwoLevelList() throws IOException {
     assertThat(recordRead.get("topbytes")).isEqualTo(expectedBinary);
   }
 
+  @Test
+  public void testColumnStatisticsEnabled() throws Exception {
+    Schema schema =
+        new Schema(
+            optional(1, "int_field", IntegerType.get()),
+            optional(2, "string_field", Types.StringType.get()));
+
+    File file = createTempFile(temp);
+
+    List<GenericData.Record> records = Lists.newArrayListWithCapacity(5);
+    org.apache.avro.Schema avroSchema = 
AvroSchemaUtil.convert(schema.asStruct());
+    for (int i = 1; i <= 5; i++) {
+      GenericData.Record record = new GenericData.Record(avroSchema);
+      record.put("int_field", i);
+      record.put("string_field", "test");
+      records.add(record);
+    }
+
+    write(
+        file,
+        schema,
+        ImmutableMap.<String, String>builder()
+            .put(PARQUET_COLUMN_STATS_ENABLED_PREFIX + "int_field", "true")

Review Comment:
   can we also test the default behavior ?



##########
docs/docs/configuration.md:
##########
@@ -52,6 +52,8 @@ Iceberg tables support table properties to configure table 
behavior, like the de
 | write.parquet.bloom-filter-enabled.column.col1       | (not set)             
      | Hint to parquet to write a bloom filter for the column: 'col1'          
                                                                                
                                          |
 | write.parquet.bloom-filter-max-bytes                 | 1048576 (1 MB)        
      | The maximum number of bytes for a bloom filter bitset                   
                                                                                
                                          |
 | write.parquet.bloom-filter-fpp.column.col1           | 0.01                  
      | The false positive probability for a bloom filter applied to 'col1' 
(must > 0.0 and < 1.0)                                                          
                                              |
+| write.parquet.stats-enabled.column.default           | true                  
      | Default flag to enable parquet column statistics for all columns in the 
table                                                                           
                                          |
+| write.parquet.stats-enabled.column.col1              | (not set)             
      | Flag to enable parquet column statistics for column 'col1' to allow 
per-column tuning                                                               
                                              |

Review Comment:
   what's the potential allowed values for 
`write.parquet.stats-enabled.column.col1`? Is it just true or false
   
   I am thinking something like 
   ```Controls whether to collect parquet column statistics for column 'col1' 
```



##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -306,33 +308,29 @@ private WriteBuilder createContextFunc(
       return this;
     }
 
+    // Utility method to get the column path
+    private String getParquetColumnPath(Map<Integer, String> 
fieldIdToParquetPath, String colPath) {
+      Types.NestedField fieldId = schema.findField(colPath);
+      if (fieldId == null) {
+        return null;
+      }
+
+      return fieldIdToParquetPath.get(fieldId.fieldId());
+    }
+
     private void setBloomFilterConfig(
         Context context,
-        MessageType parquetSchema,
+        Map<Integer, String> fieldIdToParquetPath,
         BiConsumer<String, Boolean> withBloomFilterEnabled,
         BiConsumer<String, Double> withBloomFilterFPP) {
 
-      Map<Integer, String> fieldIdToParquetPath =
-          parquetSchema.getColumns().stream()
-              .filter(col -> col.getPrimitiveType().getId() != null)
-              .collect(
-                  Collectors.toMap(
-                      col -> col.getPrimitiveType().getId().intValue(),
-                      col -> String.join(".", col.getPath())));
-
       context
           .columnBloomFilterEnabled()
           .forEach(
               (colPath, isEnabled) -> {
-                Types.NestedField fieldId = schema.findField(colPath);
-                if (fieldId == null) {
-                  LOG.warn("Skipping bloom filter config for missing field: 
{}", colPath);

Review Comment:
   I think we dropped this warning after refactor, want to ensure this is 
expected



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add table property to disable/enable parquet column statistics #12770 [iceberg]

Reply via email to