Re: [PR] fix `KeyError` raised by `add_files` when parquet file doe not have column stats [iceberg-python]

via GitHub Sat, 23 Nov 2024 09:58:37 -0800


kevinjqliu commented on code in PR #1354:
URL: https://github.com/apache/iceberg-python/pull/1354#discussion_r1855235419



##########
tests/io/test_pyarrow_stats.py:
##########
@@ -681,6 +685,39 @@ def test_stats_types(table_schema_nested: Schema) -> None:
     ]
 
 
+def test_read_missing_statistics() -> None:
+    # write statistics for only for "strings" column
+    metadata, table_metadata = 
construct_test_table(write_statistics=["strings"])
+
+    # expect only "strings" column to have statistics in metadata
+    assert metadata.row_group(0).column(0).is_stats_set is True
+    assert metadata.row_group(0).column(0).statistics is not None
+
+    # expect all other columns to have no statistics
+    for r in range(metadata.num_row_groups):
+        for pos in range(1, metadata.num_columns):
+            assert metadata.row_group(r).column(pos).is_stats_set is False
+            assert metadata.row_group(r).column(pos).statistics is None
+
+    schema = get_current_schema(table_metadata)
+    statistics = data_file_statistics_from_parquet_metadata(
+        parquet_metadata=metadata,
+        stats_columns=compute_statistics_plan(schema, 
table_metadata.properties),
+        parquet_column_mapping=parquet_path_to_id_mapping(schema),
+    )
+
+    datafile = DataFile(**statistics.to_serialized_dict())
+
+    # expect only "strings" column values to be reflected in the
+    # upper_bound, lower_bound and null_value_counts props of datafile
+    assert len(datafile.lower_bounds) == 1
+    assert datafile.lower_bounds[1].decode() == "aaaaaaaaaaaaaaaa"

Review Comment:
   ah i see, for readability, wdyt about something like
   ```
   string_col_idx = 1
   datafile.lower_bounds[string_col_idx]
   ```



##########
tests/io/test_pyarrow_stats.py:
##########
@@ -681,6 +685,39 @@ def test_stats_types(table_schema_nested: Schema) -> None:
     ]
 
 
+def test_read_missing_statistics() -> None:
+    # write statistics for only for "strings" column
+    metadata, table_metadata = 
construct_test_table(write_statistics=["strings"])
+
+    # expect only "strings" column to have statistics in metadata
+    assert metadata.row_group(0).column(0).is_stats_set is True
+    assert metadata.row_group(0).column(0).statistics is not None
+
+    # expect all other columns to have no statistics
+    for r in range(metadata.num_row_groups):
+        for pos in range(1, metadata.num_columns):
+            assert metadata.row_group(r).column(pos).is_stats_set is False
+            assert metadata.row_group(r).column(pos).statistics is None
+
+    schema = get_current_schema(table_metadata)
+    statistics = data_file_statistics_from_parquet_metadata(
+        parquet_metadata=metadata,
+        stats_columns=compute_statistics_plan(schema, 
table_metadata.properties),
+        parquet_column_mapping=parquet_path_to_id_mapping(schema),
+    )
+
+    datafile = DataFile(**statistics.to_serialized_dict())
+
+    # expect only "strings" column values to be reflected in the
+    # upper_bound, lower_bound and null_value_counts props of datafile
+    assert len(datafile.lower_bounds) == 1
+    assert datafile.lower_bounds[1].decode() == "aaaaaaaaaaaaaaaa"
+    assert len(datafile.upper_bounds) == 1
+    assert datafile.upper_bounds[1].decode() == "zzzzzzzzzzzzzzz{"
+    assert len(datafile.null_value_counts) == 1
+    assert datafile.null_value_counts[1] == 1

Review Comment:
   ah 
https://github.com/apache/iceberg-python/blob/939d325693e5a484bed0377310d49f04eb1fef31/pyiceberg/io/pyarrow.py#L2305-L2306
   
   its calculated in `to_serialized_dict`. `column_aggregates` is converted to 
`lower_bounds` and `upper_bounds` 



##########
tests/io/test_pyarrow_stats.py:
##########
@@ -681,6 +685,39 @@ def test_stats_types(table_schema_nested: Schema) -> None:
     ]
 
 
+def test_read_missing_statistics() -> None:
+    # write statistics for only for "strings" column
+    metadata, table_metadata = 
construct_test_table(write_statistics=["strings"])
+
+    # expect only "strings" column to have statistics in metadata
+    assert metadata.row_group(0).column(0).is_stats_set is True
+    assert metadata.row_group(0).column(0).statistics is not None
+
+    # expect all other columns to have no statistics
+    for r in range(metadata.num_row_groups):
+        for pos in range(1, metadata.num_columns):
+            assert metadata.row_group(r).column(pos).is_stats_set is False
+            assert metadata.row_group(r).column(pos).statistics is None
+
+    schema = get_current_schema(table_metadata)
+    statistics = data_file_statistics_from_parquet_metadata(
+        parquet_metadata=metadata,
+        stats_columns=compute_statistics_plan(schema, 
table_metadata.properties),
+        parquet_column_mapping=parquet_path_to_id_mapping(schema),
+    )
+
+    datafile = DataFile(**statistics.to_serialized_dict())
+
+    # expect only "strings" column values to be reflected in the
+    # upper_bound, lower_bound and null_value_counts props of datafile
+    assert len(datafile.lower_bounds) == 1
+    assert datafile.lower_bounds[1].decode() == "aaaaaaaaaaaaaaaa"
+    assert len(datafile.upper_bounds) == 1
+    assert datafile.upper_bounds[1].decode() == "zzzzzzzzzzzzzzz{"
+    assert len(datafile.null_value_counts) == 1
+    assert datafile.null_value_counts[1] == 1

Review Comment:
   we can check the `statistics` variable, but I'm now wondering if we forgot 
to propagate `column_aggregates` to DataFile.
   Otherwise, why even [update 
`column_aggregates`](https://github.com/apache/iceberg-python/blob/939d325693e5a484bed0377310d49f04eb1fef31/pyiceberg/io/pyarrow.py#L2387-L2388)?
 



##########
tests/io/test_pyarrow_stats.py:
##########
@@ -681,6 +685,39 @@ def test_stats_types(table_schema_nested: Schema) -> None:
     ]
 
 
+def test_read_missing_statistics() -> None:
+    # write statistics for only for "strings" column
+    metadata, table_metadata = 
construct_test_table(write_statistics=["strings"])
+
+    # expect only "strings" column to have statistics in metadata
+    assert metadata.row_group(0).column(0).is_stats_set is True
+    assert metadata.row_group(0).column(0).statistics is not None
+
+    # expect all other columns to have no statistics
+    for r in range(metadata.num_row_groups):
+        for pos in range(1, metadata.num_columns):
+            assert metadata.row_group(r).column(pos).is_stats_set is False
+            assert metadata.row_group(r).column(pos).statistics is None
+
+    schema = get_current_schema(table_metadata)
+    statistics = data_file_statistics_from_parquet_metadata(
+        parquet_metadata=metadata,
+        stats_columns=compute_statistics_plan(schema, 
table_metadata.properties),
+        parquet_column_mapping=parquet_path_to_id_mapping(schema),
+    )
+
+    datafile = DataFile(**statistics.to_serialized_dict())
+
+    # expect only "strings" column values to be reflected in the
+    # upper_bound, lower_bound and null_value_counts props of datafile
+    assert len(datafile.lower_bounds) == 1
+    assert datafile.lower_bounds[1].decode() == "aaaaaaaaaaaaaaaa"
+    assert len(datafile.upper_bounds) == 1
+    assert datafile.upper_bounds[1].decode() == "zzzzzzzzzzzzzzz{"
+    assert len(datafile.null_value_counts) == 1
+    assert datafile.null_value_counts[1] == 1

Review Comment:
   and those are checked in the test already! thanks



##########
tests/io/test_pyarrow_stats.py:
##########
@@ -681,6 +685,39 @@ def test_stats_types(table_schema_nested: Schema) -> None:
     ]
 
 
+def test_read_missing_statistics() -> None:
+    # write statistics for only for "strings" column
+    metadata, table_metadata = 
construct_test_table(write_statistics=["strings"])
+
+    # expect only "strings" column to have statistics in metadata
+    assert metadata.row_group(0).column(0).is_stats_set is True
+    assert metadata.row_group(0).column(0).statistics is not None
+
+    # expect all other columns to have no statistics
+    for r in range(metadata.num_row_groups):
+        for pos in range(1, metadata.num_columns):
+            assert metadata.row_group(r).column(pos).is_stats_set is False
+            assert metadata.row_group(r).column(pos).statistics is None
+
+    schema = get_current_schema(table_metadata)
+    statistics = data_file_statistics_from_parquet_metadata(
+        parquet_metadata=metadata,
+        stats_columns=compute_statistics_plan(schema, 
table_metadata.properties),
+        parquet_column_mapping=parquet_path_to_id_mapping(schema),
+    )
+
+    datafile = DataFile(**statistics.to_serialized_dict())
+
+    # expect only "strings" column values to be reflected in the
+    # upper_bound, lower_bound and null_value_counts props of datafile
+    assert len(datafile.lower_bounds) == 1
+    assert datafile.lower_bounds[1].decode() == "aaaaaaaaaaaaaaaa"
+    assert len(datafile.upper_bounds) == 1
+    assert datafile.upper_bounds[1].decode() == "zzzzzzzzzzzzzzz{"
+    assert len(datafile.null_value_counts) == 1
+    assert datafile.null_value_counts[1] == 1

Review Comment:
   weird, [`DataFileStatistics` sets the 
`column_aggregates`](https://github.com/apache/iceberg-python/blob/939d325693e5a484bed0377310d49f04eb1fef31/pyiceberg/io/pyarrow.py#L2409),
 which does not exist in 
[DataFile](https://github.com/apache/iceberg-python/blob/63169004facb053d3ac3725388411ee7a05c4156/pyiceberg/manifest.py#L311-L330)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] fix `KeyError` raised by `add_files` when parquet file doe not have column stats [iceberg-python]

Reply via email to