[I] Pyarrow dataset repartitioning performance [arrow]

via GitHub Tue, 17 Dec 2024 10:31:03 -0800


mtofano opened a new issue, #45054:
URL: https://github.com/apache/arrow/issues/45054


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hi there,
   
   I am using `pyarrow.dataset` to repartition a dataset. My code looks like 
this:
   
   ```py
   source_dataset = ds.dataset(
       source=files,  # a list of file paths
       filesystem=filesystem,  # an S3FileSystem object
       format="parquet",
       partitioning=ds.partitioning(
           schema=pa.schema(
               fields=[
                   ("date", pa.date32()),
                   ("ulsym", pa.string())
               ]
           ),
           flavor="hive"
       )
   )
   
   scanner: ds.Scanner = (
       dataset.scanner(
           columns={
               "symbol": pc.field("sym"),
               "as_of_time": pc.field("asofTime"),
               "event_time": pc.field("eventTime"),
               "bid": pc.field("bid"),
               "ask": pc.field("ask"),
               "bid_size": pc.field("bsize"),
               "ask_size": pc.field("asize"),
               "ds": pc.field("date"),
               "root_symbol": pc.field("ulsym")
           }, 
           batch_readahead=64, 
           batch_size=1_000_000
       )
   )
   
   ds.write_dataset(
       data=scanner,
       base_dir=out_path,
       filesystem=out_filesystem,
       format="parquet",
       partitioning=ds.partitioning(
           schema=pa.schema(fields=[("ds", pa.date32())]),
           flavor="hive"
       ),
       
file_options=ds.ParquetFileFormat().make_write_options(compression="zstd"),
       existing_data_behavior="delete_matching"
   )
   ```
   
   The source dataset is ~1TB of data. 
   
   I have 24 cores and 300GB of RAM on my machine. How can I optimize this to 
improve IO performance. At the moment it takes ~1hr to write the entire dataset 
out. Below is a snapshot of my htop output:
   
   
![image](https://github.com/user-attachments/assets/5b45d70b-a06f-41b0-a110-dfdc617e3abb)
   
   I find it strange that I am not utilized more CPU and RAM on my machine. Is 
that reasonable? How can I optimize this in order to improve IO performance?
   
   Any insights at all are much appreciated! Thank you.
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Pyarrow dataset repartitioning performance [arrow]

Reply via email to