Re: [PR] [fix](regression) stabilize parquet_meta tvf assertions across s3 buckets [doris]

via GitHub Sun, 01 Mar 2026 18:17:24 -0800


Copilot commented on code in PR #60938:
URL: https://github.com/apache/doris/pull/60938#discussion_r2870216576



##########
regression-test/suites/external_table_p0/tvf/test_parquet_meta_tvf.groovy:
##########
@@ -33,37 +33,51 @@ suite("test_parquet_meta_tvf", 
"p0,external,external_docker,tvf") {
     // parquet_metadata (S3)
     // Note: Prefer asserting on stable metadata columns; avoid relying on 
host-specific/local-only paths.
     order_qt_parquet_metadata_s3 """
-        select * from parquet_meta(
+        select
+            row_group_id, row_group_num_rows, row_group_num_columns, 
row_group_bytes, column_id,
+            file_offset, num_values, path_in_schema, type, stats_min, 
stats_max, stats_null_count,
+            stats_distinct_count, stats_min_value
+        from parquet_meta(
             "uri" = "${basePath}/meta.parquet",
             "s3.access_key" = "${ak}",
             "s3.secret_key" = "${sk}",
             "endpoint" = "${endpoint}",
             "region" = "${region}",
             "mode" = "parquet_metadata"
-        );
+        )
+        order by row_group_id, column_id;
     """
 
     // default mode: parquet_metadata
     order_qt_parquet_metadata_default_mode """
-        select * from parquet_meta(
+        select
+            row_group_id, row_group_num_rows, row_group_num_columns, 
row_group_bytes, column_id,
+            file_offset, num_values, path_in_schema, type, stats_min, 
stats_max, stats_null_count,
+            stats_distinct_count, stats_min_value

Review Comment:
   The stable column projection list is duplicated across the S3 and 
default-mode parquet_metadata queries. To reduce maintenance overhead (and 
avoid future drift), consider extracting this shared column list into a single 
Groovy variable/string and interpolating it into both query strings.



##########
regression-test/suites/external_table_p0/tvf/test_parquet_meta_tvf.groovy:
##########
@@ -33,37 +33,51 @@ suite("test_parquet_meta_tvf", 
"p0,external,external_docker,tvf") {
     // parquet_metadata (S3)
     // Note: Prefer asserting on stable metadata columns; avoid relying on 
host-specific/local-only paths.
     order_qt_parquet_metadata_s3 """
-        select * from parquet_meta(
+        select
+            row_group_id, row_group_num_rows, row_group_num_columns, 
row_group_bytes, column_id,
+            file_offset, num_values, path_in_schema, type, stats_min, 
stats_max, stats_null_count,
+            stats_distinct_count, stats_min_value

Review Comment:
   The stable column projection list is duplicated across the S3 and 
default-mode parquet_metadata queries. To reduce maintenance overhead (and 
avoid future drift), consider extracting this shared column list into a single 
Groovy variable/string and interpolating it into both query strings.



##########
regression-test/suites/external_table_p0/tvf/test_parquet_meta_tvf.groovy:
##########
@@ -104,7 +119,7 @@ suite("test_parquet_meta_tvf", 
"p0,external,external_docker,tvf") {
 
     // file metadata (S3 glob)
     order_qt_parquet_file_metadata_s3_glob """
-        select file_name from parquet_meta(
+        select count(*) from parquet_meta(
             "uri" = "${basePath}/*meta.parquet",

Review Comment:
   Switching the glob assertion from `select file_name` to `count(*)` makes the 
test significantly weaker (it can pass even if the *wrong* set of files is 
returned, as long as the total count matches). A more robust, bucket-stable 
assertion is to project a stable derived value from `file_name` (e.g., the 
basename via a regex) and order by it, so the test still validates which files 
are present without depending on the bucket/URI prefix.



##########
regression-test/suites/external_table_p0/tvf/test_parquet_meta_tvf.groovy:
##########
@@ -33,37 +33,51 @@ suite("test_parquet_meta_tvf", 
"p0,external,external_docker,tvf") {
     // parquet_metadata (S3)
     // Note: Prefer asserting on stable metadata columns; avoid relying on 
host-specific/local-only paths.
     order_qt_parquet_metadata_s3 """
-        select * from parquet_meta(
+        select
+            row_group_id, row_group_num_rows, row_group_num_columns, 
row_group_bytes, column_id,
+            file_offset, num_values, path_in_schema, type, stats_min, 
stats_max, stats_null_count,
+            stats_distinct_count, stats_min_value

Review Comment:
   The projection includes `stats_min_value` but not the corresponding 
`stats_max_value`. If the intent is to keep the assertion symmetric and 
interpretable (especially for non-numeric types), consider selecting both 
`stats_min_value` and `stats_max_value` (or neither) so the test coverage of 
stats doesn’t become lopsided.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [fix](regression) stabilize parquet_meta tvf assertions across s3 buckets [doris]

Reply via email to