dramaticlly commented on code in PR #12771: URL: https://github.com/apache/iceberg/pull/12771#discussion_r2045632526
########## core/src/main/java/org/apache/iceberg/TableProperties.java: ########## @@ -174,6 +174,13 @@ private TableProperties() {} public static final String PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX = "write.parquet.bloom-filter-enabled.column."; + public static final String PARQUET_COLUMN_STATS_ENABLED_PREFIX = + "write.parquet.stats-enabled.column."; + + public static final String DEFAULT_PARQUET_COLUMN_STATS_ENABLED = + PARQUET_COLUMN_STATS_ENABLED_PREFIX + "default"; + public static final boolean DEFAULT_PARQUET_COLUMN_STATS_ENABLED_DEFAULT = true; Review Comment: how about - PARQUET_COLUMN_STATS_ENABLED - PARQUET_COLUMN_STATS_ENABLED_DEFAULT ########## parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java: ########## @@ -219,6 +221,50 @@ public void testTwoLevelList() throws IOException { assertThat(recordRead.get("topbytes")).isEqualTo(expectedBinary); } + @Test + public void testColumnStatisticsEnabled() throws Exception { + Schema schema = + new Schema( + optional(1, "int_field", IntegerType.get()), + optional(2, "string_field", Types.StringType.get())); + + File file = createTempFile(temp); + + List<GenericData.Record> records = Lists.newArrayListWithCapacity(5); + org.apache.avro.Schema avroSchema = AvroSchemaUtil.convert(schema.asStruct()); + for (int i = 1; i <= 5; i++) { + GenericData.Record record = new GenericData.Record(avroSchema); + record.put("int_field", i); + record.put("string_field", "test"); + records.add(record); + } + + write( + file, + schema, + ImmutableMap.<String, String>builder() + .put(PARQUET_COLUMN_STATS_ENABLED_PREFIX + "int_field", "true") Review Comment: can we also test the default behavior ? ########## docs/docs/configuration.md: ########## @@ -52,6 +52,8 @@ Iceberg tables support table properties to configure table behavior, like the de | write.parquet.bloom-filter-enabled.column.col1 | (not set) | Hint to parquet to write a bloom filter for the column: 'col1' | | write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset | | write.parquet.bloom-filter-fpp.column.col1 | 0.01 | The false positive probability for a bloom filter applied to 'col1' (must > 0.0 and < 1.0) | +| write.parquet.stats-enabled.column.default | true | Default flag to enable parquet column statistics for all columns in the table | +| write.parquet.stats-enabled.column.col1 | (not set) | Flag to enable parquet column statistics for column 'col1' to allow per-column tuning | Review Comment: what's the potential allowed values for `write.parquet.stats-enabled.column.col1`? Is it just true or false I am thinking something like ```Controls whether to collect parquet column statistics for column 'col1' ``` ########## parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java: ########## @@ -306,33 +308,29 @@ private WriteBuilder createContextFunc( return this; } + // Utility method to get the column path + private String getParquetColumnPath(Map<Integer, String> fieldIdToParquetPath, String colPath) { + Types.NestedField fieldId = schema.findField(colPath); + if (fieldId == null) { + return null; + } + + return fieldIdToParquetPath.get(fieldId.fieldId()); + } + private void setBloomFilterConfig( Context context, - MessageType parquetSchema, + Map<Integer, String> fieldIdToParquetPath, BiConsumer<String, Boolean> withBloomFilterEnabled, BiConsumer<String, Double> withBloomFilterFPP) { - Map<Integer, String> fieldIdToParquetPath = - parquetSchema.getColumns().stream() - .filter(col -> col.getPrimitiveType().getId() != null) - .collect( - Collectors.toMap( - col -> col.getPrimitiveType().getId().intValue(), - col -> String.join(".", col.getPath()))); - context .columnBloomFilterEnabled() .forEach( (colPath, isEnabled) -> { - Types.NestedField fieldId = schema.findField(colPath); - if (fieldId == null) { - LOG.warn("Skipping bloom filter config for missing field: {}", colPath); Review Comment: I think we dropped this warning after refactor, want to ensure this is expected -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org