GabrielM98 opened a new issue, #15347:
URL: https://github.com/apache/iceberg/issues/15347
### Apache Iceberg version
None
### Query engine
None
### Please describe the bug 🐞
#12771 added a write table property
(`write.parquet.stats-enabled.column.<COLUMN_NAME>`) to allow statistics to be
disabled on a per-column basis. However, it appears that this only seems to
work for a single column?
When adding a couple of properties to the table to disable stats across the
`wire_format_message` and `json_format_message` columns, it appeared that stats
were still being written to the Parquet file for the latter column. Here's some
output from the Parquet CLI/DuckDB which I used to confirm this...
```
Row group 0: count: 1000 117.33 B records start: 4 total(compressed):
114.576 kB total(uncompressed):675.036 kB
--------------------------------------------------------------------------------
type
encodings count avg size nulls min / max
...
wire_format_message BINARY
Z _ 1000 36.82 B
json_format_message BINARY
Z _ 1000 36.02 B 0 "{"eventMetadata":{"uuid":..." /
"{"eventMetadata":{"uuid":..."
...
➜ duckdb
DuckDB v1.4.4 (Andium) 6ddac802ff
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D ATTACH 'warehouse' AS iceberg_catalog (
TYPE iceberg,
ENDPOINT 'http://localhost:8181',
AUTHORIZATION_TYPE 'none'
);
D SELECT * FROM
iceberg_table_properties(iceberg_catalog.events.entity_events);
┌─────────────────────────────────────────────────────────┬─────────┐
│ key │ value │
│ varchar │ varchar │
├─────────────────────────────────────────────────────────┼─────────┤
│ write.parquet.compression-codec │ zstd │
│ commit.retry.total-timeout-ms │ 120000 │
│ commit.retry.min-wait-ms │ 3000 │
│ write.parquet.stats-enabled.column.wire_format_message │ false │
│ write.distribution-mode │ hash │
│ commit.retry.num-retries │ 5 │
│ write.parquet.stats-enabled.column.json_format_message │ false │
│ commit.retry.max-wait-ms │ 60000 │
│ owner │ root │
└─────────────────────────────────────────────────────────┴─────────┘
```
I was able to reproduce this bug in the tests by changing the string
[here](https://github.com/apache/iceberg/blob/a3c538f647a113cf61dbfd7855d9c71da8c722ba/parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java#L250)
to `"false"` and flipping the boolean in the assertion
[here](https://github.com/apache/iceberg/blob/a3c538f647a113cf61dbfd7855d9c71da8c722ba/parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java#L265)
to `false`. This resulted in the test failing as stats were still being
written for the `int_field`.
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [x] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]