mtofano opened a new issue, #45054: URL: https://github.com/apache/arrow/issues/45054
### Describe the usage question you have. Please include as many useful details as possible. Hi there, I am using `pyarrow.dataset` to repartition a dataset. My code looks like this: ```py source_dataset = ds.dataset( source=files, # a list of file paths filesystem=filesystem, # an S3FileSystem object format="parquet", partitioning=ds.partitioning( schema=pa.schema( fields=[ ("date", pa.date32()), ("ulsym", pa.string()) ] ), flavor="hive" ) ) scanner: ds.Scanner = ( dataset.scanner( columns={ "symbol": pc.field("sym"), "as_of_time": pc.field("asofTime"), "event_time": pc.field("eventTime"), "bid": pc.field("bid"), "ask": pc.field("ask"), "bid_size": pc.field("bsize"), "ask_size": pc.field("asize"), "ds": pc.field("date"), "root_symbol": pc.field("ulsym") }, batch_readahead=64, batch_size=1_000_000 ) ) ds.write_dataset( data=scanner, base_dir=out_path, filesystem=out_filesystem, format="parquet", partitioning=ds.partitioning( schema=pa.schema(fields=[("ds", pa.date32())]), flavor="hive" ), file_options=ds.ParquetFileFormat().make_write_options(compression="zstd"), existing_data_behavior="delete_matching" ) ``` The source dataset is ~1TB of data. I have 24 cores and 300GB of RAM on my machine. How can I optimize this to improve IO performance. At the moment it takes ~1hr to write the entire dataset out. Below is a snapshot of my htop output:  I find it strange that I am not utilized more CPU and RAM on my machine. Is that reasonable? How can I optimize this in order to improve IO performance? Any insights at all are much appreciated! Thank you. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org