[I] pyarrow 18 high memory consumption [arrow]

via GitHub Mon, 13 Jan 2025 01:53:31 -0800


kubat-square-sense opened a new issue, #45236:
URL: https://github.com/apache/arrow/issues/45236


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   We noticed high memory consumption while reading parquet files with pyarrow 
18.1.
   Loading a 600Kb parquet file into a 22Mb consumes over 1 Gb of memory. On 3 
different machines (wsl, linux, macos), profiling with memray showed a peak 
memory of 1 Gb, 1.1 Gb and 1.8 Gb.
   
   Running the same code with pyarrow 17 consumes less than 200 Mb. 
   
   
   Its quite simple to reproduce. I've attached a dummy parquet which consume 
slightly less but still over 1 GB. 
   
   ```python
   import pyarrow.parquet as pq
   
   data = pq.read_table('test.parquet')
   
   print(data.nbytes / 1024**2)
   ```
   
   
   
   
   [test.zip](https://github.com/user-attachments/files/18395000/test.zip)
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] pyarrow 18 high memory consumption [arrow]

Reply via email to