StuartHadfield opened a new issue, #34264: URL: https://github.com/apache/arrow/issues/34264
### Describe the usage question you have. Please include as many useful details as possible. Hello! I was wondering if it was possible to control the filesize of a resultant file when writing a parquet dataset? This is naturally quite important - controlling file size of individual files is a real key part of optimising a parquet dataset. In Dask you can use https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.repartition.html to repartition a dataframe easily - I don't think we have anything similar in PyArrow, but I know I can get somewhere by doing: ```py bytes_written = 0 index = 0 writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema) for i in range(300000): writer.write_table(table) bytes_written = bytes_written + table.nbytes if bytes_written >= 500000000: # 500MB, start a new file writer.close() index = index + 1 writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema) bytes_written = 0 writer.close() ``` This limits the bytes fed to the writer, which is cool - I would have to establish the compression ratio I expect - but I don't know how to achieve something similar for writing a **dataset** beyond partitioning on some column? If there's a recommended strategy for this, I'd be really appreciative if you could share! ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
