Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24088 )

Change subject: IMPALA-14589: [Part 1] Add support for Iceberg V3 default values
......................................................................


Patch Set 5:

(11 comments)

Thanks for working on this!

http://gerrit.cloudera.org:8080/#/c/24088/5/be/src/exec/file-metadata-utils.cc
File be/src/exec/file-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/24088/5/be/src/exec/file-metadata-utils.cc@52
PS5, Line 52:     AddIcebergColumns(mem_pool, &template_tuple, 
slot_descs_written);
            :
            :     PopulateIcebergDefaults(template_tuple, mem_pool);
Shouldn't the order be PopulateIcebergDefaults() then AddIcebergColumns(), to 
make partition values override default values?

I think it only works in the tests because the Parquet data files also contain 
the partition columns.


http://gerrit.cloudera.org:8080/#/c/24088/5/be/src/exec/file-metadata-utils.cc@63
PS5, Line 63:   TextConverter text_converter(/* escape_char */ '\\',
            :       scan_node_->hdfs_table()->null_partition_key_value(),
            :       /* check_null */ true, /* strict_mode */ true);
Same as L163, we could have a utility method that creates the text converter 
with the correct parameters.


http://gerrit.cloudera.org:8080/#/c/24088/5/be/src/exec/file-metadata-utils.cc@74
PS5, Line 74: if (col_idx >= scan_node_->hdfs_table()->col_descs().size()) 
continue;
Is is possible? If not, we could have a DCHECK instead.


http://gerrit.cloudera.org:8080/#/c/24088/5/be/src/exec/file-metadata-utils.cc@86
PS5, Line 86: }
Handle the case when WriteSlot() fails. See L197.


http://gerrit.cloudera.org:8080/#/c/24088/5/be/src/exec/file-metadata-utils.cc@156
PS5, Line 156: scan_node_->hdfs_table()->col_descs()
We should DCHECK that path.front() is < col_descs().size().


http://gerrit.cloudera.org:8080/#/c/24088/5/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/24088/5/be/src/exec/hdfs-scan-node-base.cc@417
PS5, Line 417: HasVirtualColumnInTemplateTuple()) return nullptr;
Shouldn't we create a template tuple whene there are columns with default 
values? But we might only want to create the template tuple if the 
corresponding column is missing from the data file.


http://gerrit.cloudera.org:8080/#/c/24088/5/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/24088/5/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@1062
PS5, Line 1062: checkForNestedDefaultValues
Can we have tests for this? It's OK to only have it with a single file format 
(Parquet), as this check is file-format agnostic.


http://gerrit.cloudera.org:8080/#/c/24088/5/testdata/workloads/functional-query/queries/QueryTest/iceberg-v3-default-values.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v3-default-values.test:

http://gerrit.cloudera.org:8080/#/c/24088/5/testdata/workloads/functional-query/queries/QueryTest/iceberg-v3-default-values.test@244
PS5, Line 244: # For now, INSERT operations will use NULL for missing columns.
Can we raise an error instead when there's a write-default? As the current 
behavior is not in line with the expected Iceberg V3 behavior.

We should also have tests where we explicitly write NULLs.


http://gerrit.cloudera.org:8080/#/c/24088/5/testdata/workloads/functional-query/queries/QueryTest/iceberg-v3-default-values.test@293
PS5, Line 293: # Test that non-partitioned column 'month' would use default if 
missing
             : # month has initial-default=12
I don't see that in the results. But I think the NULLs are correct here since 
they are explicitly written in the data files.


http://gerrit.cloudera.org:8080/#/c/24088/5/testdata/workloads/functional-query/queries/QueryTest/iceberg-v3-negative.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v3-negative.test:

http://gerrit.cloudera.org:8080/#/c/24088/5/testdata/workloads/functional-query/queries/QueryTest/iceberg-v3-negative.test@47
PS5, Line 47: SELECT * FROM iceberg_v3_default_value;
            : ---- RESULTS
            : 1,-1
            : 2,2
            : ---- TYPES
            : INT,INT
I think we should just remove these tests here. We already cover them in other 
.test files. Here we should only collect the "negative tests", i.e. things that 
are not supported or raise errors.


http://gerrit.cloudera.org:8080/#/c/24088/5/tests/common/file_utils.py
File tests/common/file_utils.py:

http://gerrit.cloudera.org:8080/#/c/24088/5/tests/common/file_utils.py@43
PS5, Line 43: file_format == "orc" or file_format == "parquet" or file_format 
== "avro"
nit: file_format in ["orc", "parquet", "avro"]



--
To view, visit http://gerrit.cloudera.org:8080/24088
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f1be994a336b30b17b17819091417d777a39be9
Gerrit-Change-Number: 24088
Gerrit-PatchSet: 5
Gerrit-Owner: Arnab Karmakar <[email protected]>
Gerrit-Reviewer: Arnab Karmakar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 16 Apr 2026 12:24:11 +0000
Gerrit-HasComments: Yes

Reply via email to