raulcd opened a new issue, #48138: URL: https://github.com/apache/arrow/issues/48138
### Describe the bug, including details regarding any error messages, version, and platform. The `RowGroupMetadata.total_byte_size` property is supposed to compute the uncompressed file size: https://github.com/apache/arrow/blob/7c3d4867e40dd0100542247a61cb83520369b2d4/python/pyarrow/_parquet.pyx#L975-L978 But on a [specific file from parquet-testing](https://github.com/apache/parquet-testing/blob/master/data/alltypes_plain.snappy.parquet) I've found that is computing the compressed size, see: ```python >>> import pyarrow.parquet as pq >>> reader = pq.ParquetFile('../datanomy-extra/raulcd_local/data/02-alltypes_plain.snappy.parquet') >>> rg = reader.metadata.row_group(0) >>> rg.total_byte_size 570 >>> sum(rg.column(j).total_compressed_size for j in range(rg.num_columns)) 570 >>> sum(rg.column(j).total_uncompressed_size for j in range(rg.num_columns)) 532 ``` For a different parquet test file on datanomy is able to compute it correctly, the file is generated here: ```python @pytest.fixture def simple_parquet(tmp_path: Path) -> Path: """Create a simple test Parquet file with basic types. Returns: Path to the created Parquet file """ table = pa.table( { "id": [1, 2, 3, 4, 5], "name": ["Alice", "Bob", "Charlie", "David", "Eve"], "age": [25, 30, 35, 40, 45], "score": [85.5, 90.0, 78.5, 92.0, 88.5], } ) file_path = tmp_path / "simple.parquet" pq.write_table(table, file_path) return file_path ``` ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
