theogaraj opened a new issue, #45018:
URL: https://github.com/apache/arrow/issues/45018

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   NOTE:  I posted this to StackOverflow a few days ago, reposting here for 
more focused attention of the Arrow team.  Apologies if this breaches protocol.
   
   My code and observations are based on `pyarrow` version `18.0.0`.
   
   **Question**: is there some internally imposed upper limit on the 
`min_rows_per_group` parameter of `pyarrow.dataset.write_dataset`?
   
   I have a number of parquet files that I am trying to coalesce into larger 
files with larger row-groups.
   
   #### With write_dataset (doesn't work as expected)
   
   To do so, I am iterating over the files and using 
`pyarrow.parquet.ParquetFile.iter_batches` to yield record batches, which form 
the input to `pyarrow.dataset.write_dataset`. I would like to write row-groups 
of ~9 million.
   
   ```python
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   def batch_generator():
       source = UPath(INPATH)
       for f in source.glob('*.parquet'):
           with pq.ParquetFile(f) as pf:
               yield from pf.iter_batches(batch_size=READ_BATCH_SIZE)
   
   batches = batch_generator()
   ds.write_dataset(
           batches, 
           OUTPATH, 
           format='parquet',
           schema=SCHEMA,
           min_rows_per_group=WRITE_BATCH_SIZE,
           max_rows_per_group=WRITE_BATCH_SIZE,
           existing_data_behavior='overwrite_or_ignore'
       )
   ```
   
   **Notes:**
   
   
   - `UPath` is from `universal-pathlib` and is basically the equivalent of 
pathlib.Path but works for both local as well as cloud storage.
   - `WRITE_BATCH_SIZE` is set to ~9.1M
   - `READ_BATCH_SIZE` is a smaller value, ~ 12k, to lower memory footprint 
while reading.
   
   
   When I check my output (using `pyarrow.parquet.read_metadata`) I see that I 
have a single file (as desired) but row-groups are no larger than ~2.1M.
   
   #### Alternative approach (works as expected)
   I then tried this alternative approach to coalescing the row-groups (adapted 
from [this 
gist](https://gist.github.com/NickCrews/7a47ef4083160011e8e533531d73428c)):
   
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   def batch_generator():
       source = UPath(INPATH)
       for f in source.glob('*.parquet'):
           with pq.ParquetFile(f) as pf:
               yield from pf.iter_batches(batch_size=READ_BATCH_SIZE)
   
   def write_parquet(outpath: UPath, tables):
       with pq.ParquetWriter(outpath, SCHEMA) as writer:
           for table in tables:
               writer.write_table(table, row_group_size=WRITE_BATCH_SIZE)
   
   def aggregate_tables():
       tables = []
       current_size = 0
       for batch in batch_generator():
           table = pa.Table.from_batches([batch])
           this_size = table.num_rows
           if current_size + this_size > WRITE_BATCH_SIZE:
               yield pa.concat_tables(tables)
               tables = []
               current_size = 0
           tables.append(table)
           current_size += this_size
       if tables:
           yield pa.concat_tables(tables)
   
   write_parquet(OUTPATH, aggregate_tables(), SCHEMA)
   ```
   
   `batch_generator` remains the same as before. The key difference here is 
that I am manually combining row-groups issued by `batch_generator`, using 
`pa.concat_tables()`.
   
   My output file has one row-group of ~9M, and a second row-group of ~3M, 
which is what I expect and want.
   
   So my question is... why does `pyarrow.dataset.write_dataset` not write up 
to the requested value of `min_rows_per_group` but seem to cap off at ~2.1M?
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to