Fokko opened a new issue, #8598:
URL: https://github.com/apache/iceberg/issues/8598
### Apache Iceberg version
1.3.1 (latest release)
### Query engine
None
### Please describe the bug 🐞
With Iceberg there is some ambiguity round null metrics collections using
complex types. Let's focus on the `list` first, which illustrates the problem
very well:
```
table {
1: some_list: optional list<2: int>
}
```
The list itself does not track any statistics, which can be confusing:

Spark writes each record to a different file (default parallelism of 200).
The correct behavior would be:
```sql
CREATE TABLE s.l1 SELECT array(1,2,3) AS some_list -- Expect: {1: 0, 2: 0}
UNION ALL SELECT array(1,null,3) AS some_list -- Expect: {1: 0, 2: 1}
UNION ALL SELECT null AS some_list -- Expect: {1: 1, 2: 0}
```
Also check, if you query:
```sql
SELECT * FROM s.l1 WHERE some_list IS NULL
```
it won't push down any optimizations, and just fetches all the files:
```
2023-09-20T08:24:04.059 [206 Partial Content] s3.GetObject
warehouse.minio:9000/s/l1/metadata/06caa8ee-da4f-42f4-b6f3-51dd26a4dfde-m0.avro
172.24.0.5 617µs ⇣ 591.333µs ↑ 141 B ↓ 5.7 KiB
2023-09-20T08:24:04.173 [200 OK] s3.HeadObject
warehouse.minio:9000/s/l1/data/00000-4-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 332µs ⇣ 0s ↑ 126 B ↓ 0 B
2023-09-20T08:24:04.178 [206 Partial Content] s3.GetObject
warehouse.minio:9000/s/l1/data/00000-4-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 310µs ⇣ 295.75µs ↑ 141 B ↓ 8 B
2023-09-20T08:24:04.182 [206 Partial Content] s3.GetObject
warehouse.minio:9000/s/l1/data/00000-4-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 469µs ⇣ 448.542µs ↑ 141 B ↓ 453 B
2023-09-20T08:24:04.189 [206 Partial Content] s3.GetObject
warehouse.minio:9000/s/l1/data/00000-4-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 307µs ⇣ 292.667µs ↑ 141 B ↓ 546 B
2023-09-20T08:24:04.193 [200 OK] s3.HeadObject
warehouse.minio:9000/s/l1/data/00001-5-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 227µs ⇣ 0s ↑ 126 B ↓ 0 B
2023-09-20T08:24:04.196 [206 Partial Content] s3.GetObject
warehouse.minio:9000/s/l1/data/00001-5-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 205µs ⇣ 195.666µs ↑ 141 B ↓ 8 B
2023-09-20T08:24:04.198 [206 Partial Content] s3.GetObject
warehouse.minio:9000/s/l1/data/00001-5-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 326µs ⇣ 311.334µs ↑ 141 B ↓ 452 B
2023-09-20T08:24:04.202 [206 Partial Content] s3.GetObject
warehouse.minio:9000/s/l1/data/00001-5-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 332µs ⇣ 319.041µs ↑ 141 B ↓ 542 B
2023-09-20T08:24:04.208 [200 OK] s3.HeadObject
warehouse.minio:9000/s/l1/data/00002-6-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 227µs ⇣ 0s ↑ 126 B ↓ 0 B
2023-09-20T08:24:04.210 [206 Partial Content] s3.GetObject
warehouse.minio:9000/s/l1/data/00002-6-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 279µs ⇣ 267.333µs ↑ 141 B ↓ 8 B
2023-09-20T08:24:04.213 [206 Partial Content] s3.GetObject
warehouse.minio:9000/s/l1/data/00002-6-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 467µs ⇣ 449.584µs ↑ 141 B ↓ 428 B
2023-09-20T08:24:04.217 [206 Partial Content] s3.GetObject
warehouse.minio:9000/s/l1/data/00002-6-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
172.24.0.5 289µs ⇣ 274.75µs ↑ 141 B ↓ 504 B
````
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]