bhavya-sl opened a new issue, #46340: URL: https://github.com/apache/arrow/issues/46340
### Describe the bug, including details regarding any error messages, version, and platform. **Describe the bug** When creating a FileSystemDataset using pyarrow.dataset.parquet_dataset with a metadata file, the memory consumption is approximately double the expected amount due to duplicate metadata being loaded and retained in memory. **Environment** System: Amazon Linux 2023 Python: 3.9.16 PyArrow: 20.0.0 Dataset Characteristics: - Metadata file size: ~200MB - Contains hundreds of file fragments - Row groups sized between 1e6-1e8 rows **Run Details:** - Python file: ``` import pyarrow.dataset as ds myds = ds.parquet_dataset("/home/ec2-user/tmp/dataset/key1=value1/key2=value/_metadata") ``` - Command `valgrind --tool=massif --pages-as-heap=yes --time-unit=ms python3 mem_analysis.py` - Output ``` ==49947== Massif, a heap profiler ==49947== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote ==49947== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info ==49947== Command: python3 /home/ec2-user/tmp/pricepath_memanalysis/mem_analysis.py ==49947== ==49947== brk segment overflow in thread #1: can't grow to 0x4845000 ==49947== (see section Limitations in user manual) ==49947== NOTE: further instances of this message will not be shown /home/ec2-user/.local/lib/python3.9/site-packages/numpy/_core/getlimits.py:545: UserWarning: Signature b'\x00\xd0\xcc\xcc\xcc\xcc\xcc\xcc\xfb\xbf\x00\x00\x00\x00\x00\x00' for <class 'numpy.longdouble'> does not match any known type: falling back to type probe function. This warnings indicates broken support for the dtype! machar = _get_machar(dtype) ==49947== ``` - Generate ms_print output: `ms_print massif.out.49947 > ms_print.49947` **Expected behavior** Declaring the dataset using `pyarrow.dataset.dataset` and calling `count_rows` over it caches the fragments metadata. So, the memory usage of `pyarrow.dataset.parquet_dataset` should have been similar. **Actual Behavior** Valgrind massif output shows duplicate memory allocations for metadata, totaling approximately double the expected memory consumption. Attaching the output of the `ms_print` command as screenshot and as file.  [ms_print.49947.txt](https://github.com/user-attachments/files/20077031/ms_print.49947.txt) **My Analysis** - `ParquetDatasetFactory::Make()` loads the metadata file and creates a factory object - `factory.finish()` line in Python calls `ParquetDatasetFactory::CollectParquetFragments()` which creates subsets of metadata via `FileMetaData::FileMetaDataImpl::Subset()` - The original factory object's metadata remains allocated after `finish()` which causes duplicate memory - My thinking is that the factory's `filesystem_` and `format_` members (held as shared pointers) are passed into the `FileSystemDataset` constructor during the `Finish()` call. This shared ownership could potentially prevent the factory object's timely de-allocation, including its metadata. ### Component(s) Python, Parquet, C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org