bhavya-sl opened a new issue, #46340:
URL: https://github.com/apache/arrow/issues/46340

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   **Describe the bug**
   When creating a FileSystemDataset using pyarrow.dataset.parquet_dataset with 
a metadata file, the memory consumption is approximately double the expected 
amount due to duplicate metadata being loaded and retained in memory.
   
   **Environment**
   System: Amazon Linux 2023
   Python: 3.9.16
   PyArrow: 20.0.0
   Dataset Characteristics:
   - Metadata file size: ~200MB
   - Contains hundreds of file fragments
   - Row groups sized between 1e6-1e8 rows
   
   **Run Details:**
   
   - Python file:
   ```
   import pyarrow.dataset as ds
   myds = 
ds.parquet_dataset("/home/ec2-user/tmp/dataset/key1=value1/key2=value/_metadata")
   ```
   - Command
   `valgrind --tool=massif --pages-as-heap=yes --time-unit=ms python3 
mem_analysis.py`
   
   - Output
   ```
   ==49947== Massif, a heap profiler
   ==49947== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote
   ==49947== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
   ==49947== Command: python3 
/home/ec2-user/tmp/pricepath_memanalysis/mem_analysis.py
   ==49947== 
   ==49947== brk segment overflow in thread #1: can't grow to 0x4845000
   ==49947== (see section Limitations in user manual)
   ==49947== NOTE: further instances of this message will not be shown
   
/home/ec2-user/.local/lib/python3.9/site-packages/numpy/_core/getlimits.py:545: 
UserWarning: Signature 
b'\x00\xd0\xcc\xcc\xcc\xcc\xcc\xcc\xfb\xbf\x00\x00\x00\x00\x00\x00' for <class 
'numpy.longdouble'> does not match any known type: falling back to type probe 
function.
   This warnings indicates broken support for the dtype!
     machar = _get_machar(dtype)
   ==49947== 
   ```
   - Generate ms_print output:
   `ms_print massif.out.49947 > ms_print.49947`
   
   **Expected behavior**
   Declaring the dataset using `pyarrow.dataset.dataset` and calling 
`count_rows` over it caches the fragments metadata. So, the memory usage of 
`pyarrow.dataset.parquet_dataset` should have been similar.
   
   **Actual Behavior**
   Valgrind massif output shows duplicate memory allocations for metadata, 
totaling approximately double the expected memory consumption. Attaching the 
output of the `ms_print` command as screenshot and as file.
   
![Image](https://github.com/user-attachments/assets/2960cf2a-bd81-4251-b8fe-efd19ea38bd1)
   
[ms_print.49947.txt](https://github.com/user-attachments/files/20077031/ms_print.49947.txt)
   
   **My Analysis**
   - `ParquetDatasetFactory::Make()` loads the metadata file and creates a 
factory object
   - `factory.finish()` line in Python calls 
`ParquetDatasetFactory::CollectParquetFragments()`
   which creates subsets of metadata via 
`FileMetaData::FileMetaDataImpl::Subset()`
   - The original factory object's metadata remains allocated after `finish()` 
which causes duplicate memory
   - My thinking is that the factory's `filesystem_` and `format_` members 
(held as shared pointers) are passed into the `FileSystemDataset` constructor 
during the `Finish()` call. This shared ownership could potentially prevent the 
factory object's timely de-allocation, including its metadata.
   
   ### Component(s)
   
   Python, Parquet, C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to