JigaoLuo opened a new issue, #47955: URL: https://github.com/apache/arrow/issues/47955
### Describe the bug, including details regarding any error messages, version, and platform. I encountered a bug while trying to retrieve Parquet metadata for a column chunk with logical type `Decimal128(15, 2)`. - The Parquet file was generated using arrow-rs, and I can successfully access its metadata via `arrow-rs`, `DataFusion`, or this tool: https://parquet-viewer.xiangpeng.systems/ - **However, I run into an error when attempting to read the metadata using PyArrow.** I’ll attach the Parquet file (under 50MB) along with a minimal Python script to reproduce the issue. If the bug isn’t reproducible on your end, I’m happy to help investigate further. ```python #!/usr/bin/env python3 # $ python parquet_metadata_reader.py supplier.parquet import sys import pyarrow.parquet as pq def print_parquet_metadata(parquet_file): pq_metadata = pq.read_metadata(parquet_file) schema = pq_metadata.schema.to_arrow_schema() for col_idx in range(len(schema)): field = schema.field(col_idx) col_name = field.name column_meta = pq_metadata.schema.column(col_idx) print(f"Column {col_idx}: {col_name}") print(f" Type: {column_meta.physical_type}") row_group = pq_metadata.row_group(0) # Stats of the first row group rg_column = row_group.column(col_idx) print(" Stats:", rg_column.statistics) if __name__ == "__main__": if len(sys.argv) != 2: print("Usage: python parquet_metadata_reader.py <parquet_file>") sys.exit(1) try: print_parquet_metadata(sys.argv[1]) except Exception as e: print(f"Error: {e}") sys.exit(1)% ``` The error message: ``` Column 0: c_custkey Type: INT64 Stats: <pyarrow._parquet.Statistics object at 0x7f11d4accd60> has_min_max: True min: 1 max: 14999999 null_count: 0 distinct_count: None num_values: 3000188 physical_type: INT64 logical_type: None converted_type (legacy): NONE Column 1: c_nationkey Type: INT32 Stats: <pyarrow._parquet.Statistics object at 0x7f11d4accd10> has_min_max: True min: 0 max: 24 null_count: 0 distinct_count: None num_values: 3000188 physical_type: INT32 logical_type: None converted_type (legacy): NONE Column 2: c_acctbal Type: INT64 Stats: Error: Cannot extract statistics for type ``` Thanks! ## Version I installed `pyarrow` via conda: ```bash $ conda list | grep pyarrow pyarrow 21.0.0 py313h78bf25f_1 conda-forge pyarrow-core 21.0.0 py313he109ebe_1_cpu conda-forge ``` ## Platform I use bare-metal on CPU `AMD EPYC 7742 64-Core Processor` and Ubuntu from NVIDIA `5.15.0-1042-nvidia` ```bash $ uname -a Linux dgx 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux ``` ## Related issue (?) I could only find a similar one, but not exactly the same issue: https://github.com/microsoft/semantic-link-labs/issues/909 ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
