GabrielM98 opened a new issue, #15347:
URL: https://github.com/apache/iceberg/issues/15347

   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   #12771 added a write table property 
(`write.parquet.stats-enabled.column.<COLUMN_NAME>`) to allow statistics to be 
disabled on a per-column basis. However, it appears that this only seems to 
work for a single column? 
   
   When adding a couple of properties to the table to disable stats across the 
`wire_format_message` and `json_format_message` columns, it appeared that stats 
were still being written to the Parquet file for the latter column. Here's some 
output from the Parquet CLI/DuckDB which I used to confirm this...
   
   ```
   Row group 0:  count: 1000  117.33 B records  start: 4  total(compressed): 
114.576 kB total(uncompressed):675.036 kB
   
--------------------------------------------------------------------------------
                                                                         type   
   encodings count     avg size   nulls   min / max
   ...
   wire_format_message                                                   BINARY 
   Z   _     1000      36.82 B
   json_format_message                                                   BINARY 
   Z   _     1000      36.02 B    0       "{"eventMetadata":{"uuid":..." / 
"{"eventMetadata":{"uuid":..."
   ...
   
   ➜  duckdb
   DuckDB v1.4.4 (Andium) 6ddac802ff
   Enter ".help" for usage hints.
   Connected to a transient in-memory database.
   Use ".open FILENAME" to reopen on a persistent database.
   D ATTACH 'warehouse' AS iceberg_catalog (
         TYPE iceberg,
         ENDPOINT 'http://localhost:8181',
         AUTHORIZATION_TYPE 'none'
       );
   D SELECT * FROM 
iceberg_table_properties(iceberg_catalog.events.entity_events);
   ┌─────────────────────────────────────────────────────────┬─────────┐
   │                           key                           │  value  │
   │                         varchar                         │ varchar │
   ├─────────────────────────────────────────────────────────┼─────────┤
   │ write.parquet.compression-codec                         │ zstd    │
   │ commit.retry.total-timeout-ms                           │ 120000  │
   │ commit.retry.min-wait-ms                                │ 3000    │
   │ write.parquet.stats-enabled.column.wire_format_message  │ false   │
   │ write.distribution-mode                                 │ hash    │
   │ commit.retry.num-retries                                │ 5       │
   │ write.parquet.stats-enabled.column.json_format_message  │ false   │
   │ commit.retry.max-wait-ms                                │ 60000   │
   │ owner                                                   │ root    │
   └─────────────────────────────────────────────────────────┴─────────┘
   ```
   
   I was able to reproduce this bug in the tests by changing the string 
[here](https://github.com/apache/iceberg/blob/a3c538f647a113cf61dbfd7855d9c71da8c722ba/parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java#L250)
 to `"false"` and flipping the boolean in the assertion 
[here](https://github.com/apache/iceberg/blob/a3c538f647a113cf61dbfd7855d9c71da8c722ba/parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java#L265)
 to `false`. This resulted in the test failing as stats were still being 
written for the `int_field`.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to