Satyr09 opened a new issue, #22650:
URL: https://github.com/apache/datafusion/issues/22650

   ### Is your feature request related to a problem or challenge?
   
   DataFusion's Parquet writer only exposes a row-count limit for row group 
sizing, via ParquetOptions.max_row_group_size 
(datafusion.execution.parquet.max_row_group_size, default 1M rows). There is no 
way to bound a row group by bytes.
   
   A row count could be a poor proxy for row group size depending on your 
workload, because bytes-per-row varies widely with schema width. The same 
max_row_group_size = 1M yields a small row group for a narrow schema and a 
multi-hundred-MB row group for a wide one.
   
   ### Describe the solution you'd like
   
   Add an optional `max_row_group_bytes` to `ParquetOptions`, wired to 
`WriterPropertiesBuilder::set_max_row_group_bytes`.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   The capability is already available on DataFusion main, so no dependency 
bump is required. I have an implementation ready (config field, 
WriterPropertiesBuilder wiring, round-trip tests, and docs) and can open a PR 
against this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to