theogaraj opened a new issue, #45018: URL: https://github.com/apache/arrow/issues/45018
### Describe the usage question you have. Please include as many useful details as possible. NOTE: I posted this to StackOverflow a few days ago, reposting here for more focused attention of the Arrow team. Apologies if this breaches protocol. My code and observations are based on `pyarrow` version `18.0.0`. **Question**: is there some internally imposed upper limit on the `min_rows_per_group` parameter of `pyarrow.dataset.write_dataset`? I have a number of parquet files that I am trying to coalesce into larger files with larger row-groups. #### With write_dataset (doesn't work as expected) To do so, I am iterating over the files and using `pyarrow.parquet.ParquetFile.iter_batches` to yield record batches, which form the input to `pyarrow.dataset.write_dataset`. I would like to write row-groups of ~9 million. ```python import pyarrow.parquet as pq import pyarrow.dataset as ds def batch_generator(): source = UPath(INPATH) for f in source.glob('*.parquet'): with pq.ParquetFile(f) as pf: yield from pf.iter_batches(batch_size=READ_BATCH_SIZE) batches = batch_generator() ds.write_dataset( batches, OUTPATH, format='parquet', schema=SCHEMA, min_rows_per_group=WRITE_BATCH_SIZE, max_rows_per_group=WRITE_BATCH_SIZE, existing_data_behavior='overwrite_or_ignore' ) ``` **Notes:** - `UPath` is from `universal-pathlib` and is basically the equivalent of pathlib.Path but works for both local as well as cloud storage. - `WRITE_BATCH_SIZE` is set to ~9.1M - `READ_BATCH_SIZE` is a smaller value, ~ 12k, to lower memory footprint while reading. When I check my output (using `pyarrow.parquet.read_metadata`) I see that I have a single file (as desired) but row-groups are no larger than ~2.1M. #### Alternative approach (works as expected) I then tried this alternative approach to coalescing the row-groups (adapted from [this gist](https://gist.github.com/NickCrews/7a47ef4083160011e8e533531d73428c)): ```python import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds def batch_generator(): source = UPath(INPATH) for f in source.glob('*.parquet'): with pq.ParquetFile(f) as pf: yield from pf.iter_batches(batch_size=READ_BATCH_SIZE) def write_parquet(outpath: UPath, tables): with pq.ParquetWriter(outpath, SCHEMA) as writer: for table in tables: writer.write_table(table, row_group_size=WRITE_BATCH_SIZE) def aggregate_tables(): tables = [] current_size = 0 for batch in batch_generator(): table = pa.Table.from_batches([batch]) this_size = table.num_rows if current_size + this_size > WRITE_BATCH_SIZE: yield pa.concat_tables(tables) tables = [] current_size = 0 tables.append(table) current_size += this_size if tables: yield pa.concat_tables(tables) write_parquet(OUTPATH, aggregate_tables(), SCHEMA) ``` `batch_generator` remains the same as before. The key difference here is that I am manually combining row-groups issued by `batch_generator`, using `pa.concat_tables()`. My output file has one row-group of ~9M, and a second row-group of ~3M, which is what I expect and want. So my question is... why does `pyarrow.dataset.write_dataset` not write up to the requested value of `min_rows_per_group` but seem to cap off at ~2.1M? ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org