StuartHadfield opened a new issue, #34264:
URL: https://github.com/apache/arrow/issues/34264

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hello!
   
   I was wondering if it was possible to control the filesize of a resultant 
file when writing a parquet dataset? This is naturally quite important - 
controlling file size of individual files is a real key part of optimising a 
parquet dataset.
   
   In Dask you can use 
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.repartition.html
 to repartition a dataframe easily - I don't think we have anything similar in 
PyArrow, but I know I can get somewhere by doing:
   
   ```py
   bytes_written = 0
   index = 0
   writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema)
   
   for i in range(300000):
     writer.write_table(table)
     bytes_written = bytes_written + table.nbytes
     if bytes_written >= 500000000: # 500MB, start a new file
       writer.close()
       index = index + 1
       writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema)
       bytes_written = 0
   
   writer.close()
   ```
   
   This limits the bytes fed to the writer, which is cool - I would have to 
establish the compression ratio I expect - but I don't know how to achieve 
something similar for writing a **dataset** beyond partitioning on some column? 
If there's a recommended strategy for this, I'd be really appreciative if you 
could share!
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to