raulcd opened a new issue, #48138:
URL: https://github.com/apache/arrow/issues/48138

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The `RowGroupMetadata.total_byte_size` property is supposed to compute the 
uncompressed file size:
   
https://github.com/apache/arrow/blob/7c3d4867e40dd0100542247a61cb83520369b2d4/python/pyarrow/_parquet.pyx#L975-L978
   
   But on a [specific file from 
parquet-testing](https://github.com/apache/parquet-testing/blob/master/data/alltypes_plain.snappy.parquet)
 I've found that is computing the compressed size, see:
   ```python
   >>> import pyarrow.parquet as pq
   >>> reader = 
pq.ParquetFile('../datanomy-extra/raulcd_local/data/02-alltypes_plain.snappy.parquet')
   >>> rg = reader.metadata.row_group(0)
   >>> rg.total_byte_size
   570
   >>> sum(rg.column(j).total_compressed_size for j in range(rg.num_columns))
   570
   >>> sum(rg.column(j).total_uncompressed_size for j in range(rg.num_columns))
   532
   ```
   For a different parquet test file on datanomy is able to compute it 
correctly, the file is generated here:
   ```python
   @pytest.fixture
   def simple_parquet(tmp_path: Path) -> Path:
       """Create a simple test Parquet file with basic types.
   
       Returns:
           Path to the created Parquet file
       """
       table = pa.table(
           {
               "id": [1, 2, 3, 4, 5],
               "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
               "age": [25, 30, 35, 40, 45],
               "score": [85.5, 90.0, 78.5, 92.0, 88.5],
           }
       )
       file_path = tmp_path / "simple.parquet"
       pq.write_table(table, file_path)
       return file_path
   ```
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to