[I] [Python] Smaller max_rows_per_group causes higher memory usage when reading the number of row_groups from each fragment and is never released [arrow]

via GitHub Tue, 23 Jul 2024 02:21:16 -0700


vkaff opened a new issue, #43387:
URL: https://github.com/apache/arrow/issues/43387


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Let's say we use the following code to create some test parquet files
   ```python
   
   import pandas as pd
   from itertools import repeat
   from collections import defaultdict
   
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   
   schema: pa.schema = pa.schema(
       [
           ("id", pa.int64()),
           ("tag", pa.string()),
       ]
   )
   
   fast_write_options: dict = dict(
       compression="lz4",
       version="2.6",
       write_statistics=True,
   )
   
   for i, suffix in enumerate({"", "_10", "_20", "_30", "_40", "_50", "_60", 
"_70", "_80", "_90", "_100"}):
       outputs = []
   
       for tag in {"first", "second", "third", "fourth", "fifth", "sixth", 
"seventh", "eighth", "nineth", "tenth"}:
           outputs.append(
               pd.DataFrame({"id": list(range((i + 1) * int(1e6))), "tag": 
list(repeat(tag, (i + 1) * int(1e6)))})
           )
   
       df = pd.concat(outputs, ignore_index=True)
   
       writer = pq.ParquetWriter(
           str(f"./test{suffix}.parquet"),
           schema=schema,
           filesystem=pa.fs.LocalFileSystem(),
           **fast_write_options,
       )
   
       table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
       writer.write_table(table)
   
       writer.close()
   ```
   
   Then, I partition them based on the `tag` field
   
   ```python
   dataset_files = []
   for f in {"", "_10", "_20", "_30", "_40", "_50", "_60", "_70", "_80", "_90", 
"_100"}:
       dataset_files.append(f"./test{f}.parquet")
   
   input_datasets = ds.dataset(
       dataset_files,
       schema=schema,
       format="parquet",
       filesystem=pa.fs.LocalFileSystem(),
   )
   
   scanner = input_datasets.scanner(
       batch_readahead=4,
       fragment_readahead=1,
       use_threads=True,
   )
   ds.write_dataset(
       scanner,
       base_dir="base_part_dir",
       basename_template=f"part-{{i}}.parquet",
       format="parquet",
       filesystem=pa.fs.LocalFileSystem(),
       partitioning=ds.partitioning(pa.schema([("tag", pa.string())]), 
flavor="hive"),
       max_rows_per_group=1_024,
       min_rows_per_group=1_024,
       max_rows_per_file=20 * 10_480,
       existing_data_behavior="overwrite_or_ignore",
       
file_options=ds.ParquetFileFormat().make_write_options(**fast_write_options),
       use_threads=False,
   )
   ```
   
   And, try to read the row_groups of each fragment
   
   ```python
   import time
   
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   schema: pa.schema = pa.schema(
       [
           ("id", pa.int64()),
           ("tag", pa.string()),
       ]
   )
   
   
   def main():
       intermediate_dataset = ds.dataset(
           "base_part_dir",
           schema=schema,
           format="parquet",
           partitioning=ds.partitioning(pa.schema([("tag", pa.string())]), 
flavor="hive"),
       )
   
       keys = []
       for f in intermediate_dataset.get_fragments():
           keys.append(ds.get_partition_keys(f.partition_expression)["tag"])
           f.row_groups
   
   
   if __name__ == "__main__":
       time.sleep(10)
   
       print("started")
       while True:
           main()
   
           print("Sleepig...")
           time.sleep(10)
   ```
   
   On the initial execution, it uses approximately 128MB of memory and, after 
having executed the `main()` once, it goes up to 900MB, and it stays there 
after successive iterations.
   
   Changing the values of  **max_rows_per_group / min_rows_per_group** to 
65_536, the memory usage does not increase at all, or very slightly.
   
   Therefore, my question is, why is this happening? Firstly, I would expect 
that reading metadata, i.e., `f.row_groups`, would cause minimal memory usage. 
And secondly, I would expect that, after the execution of `main()` is finished, 
the memory should get released. 
   
   Is this a bug, anticipated or I use the library in the wrong way? Is there 
something that I can do to mitigate this?
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Python] Smaller max_rows_per_group causes higher memory usage when reading the number of row_groups from each fragment and is never released [arrow]

Reply via email to