[I] [Python][Parquet] Memory Leak when using ParquetWriter gets or K8s pods killed [arrow]

via GitHub Fri, 28 Mar 2025 20:29:05 -0700


Ark-kun opened a new issue, #45971:
URL: https://github.com/apache/arrow/issues/45971


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   We get data batches from BigQuery and write them to parquet. Parquet writer 
eats up all memory and crashed the pod.
   
   It's no JMalloc or whatever since we do not see the issue when we 
periodically create new ParquetWriter instances.
   
   This code leaks:
   
   ```py
   from google.cloud import bigquery
   from google.cloud.bigquery import _pandas_helpers
   from pyarrow import parquet
   
   client = bigquery.Client(project=...)
   job = client.get_job(job_id=...)
   
   result = job.result()
   arrow_schema = _pandas_helpers.bq_to_arrow_schema(result.schema)
   bqstorage_client = client._ensure_bqstorage_client()
   with parquet.ParquetWriter(where="result.parquet", schema=arrow_schema) as 
writer:
       for batch in result.to_arrow_iterable(
           bqstorage_client=bqstorage_client,
           max_queue_size=1,
           max_stream_count=1,
       ):
           writer.write_batch(batch)
   ```
   
   
![Image](https://github.com/user-attachments/assets/8df18d91-ec61-43d3-93c5-328a9e2f5826)
   
   Initially we though that the bug was in BigQuery, but we were wrong. 
https://github.com/googleapis/python-bigquery/issues/2151
   
   
   Proof:
   
   Changing from
   ```
   with parquet.ParquetWriter(where="result.parquet", schema=arrow_schema) as 
writer:
       for batch in result.to_arrow_iterable(...):
           writer.write_batch(batch)
   ```
   
   to
   ```
   for batch in result.to_arrow_iterable(...):
       with parquet.ParquetWriter(where="result.parquet", schema=arrow_schema) 
as writer:
           writer.write_batch(batch)
   ```
   fixes the memory leak.
   
   Versions: "pyarrow==19.0.1", "pyarrow==16.1.0"
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Python][Parquet] Memory Leak when using ParquetWriter gets or K8s pods killed [arrow]

Reply via email to