andygrove commented on PR #1600:
URL: 
https://github.com/apache/datafusion-ballista/pull/1600#issuecomment-4331623661

   Results from running on a Linux workstation:
   
   ### sort-based shuffle
   
   ```
   andy@woody:~/git/apache/datafusion-ballista$ ./target/release/shuffle_bench  
 --input /mnt/bigdata/tpch/sf100/lineitem.parquet/   --writer sort 
--partitioning hash   --partitions 200 --hash-columns 0,3   --memory-limit 
8589934592   --limit 10000000 --warmup 1 --iterations 1
   === Ballista Shuffle Benchmark ===
   Writer:         Sort
   Partitioning:   Hash
   Input:          /mnt/bigdata/tpch/sf100/lineitem.parquet/
   Schema:         16 cols (3xdate, 4xdecimal, 4xint, 5xstring)
   Total rows:     10000000
   Partitions:     200
   Batch size:     8192
   Memory limit:   8589934592 bytes
   Iterations:     1 (warmup 1)
   
     [warmup 1/1] write: 10.105s
     [iter 1/1] write: 7.998s
   
   === Results ===
   avg time: 7.998s
   throughput: 1250288 rows/s (total across 1 tasks)
   
   Shuffle metrics (last iteration):
     output_rows: 10000000
     write_time: 3.065s (38.3%)
     input_rows: 10000000
     repart_time: 1.561s (19.5%)
     spill_time: 3.053s (38.2%)
     spill_count: 5863
     spill_bytes: 9195242224
   ```
   
   ### hash-based shuffle
   
   ```
   andy@woody:~/git/apache/datafusion-ballista$ ./target/release/shuffle_bench  
 --input /mnt/bigdata/tpch/sf100/lineitem.parquet/   --writer hash 
--partitioning hash   --partitions 200 --hash-columns 0,3   --memory-limit 
8589934592   --limit 10000000 --warmup 1 --iterations 1
   === Ballista Shuffle Benchmark ===
   Writer:         Hash
   Partitioning:   Hash
   Input:          /mnt/bigdata/tpch/sf100/lineitem.parquet/
   Schema:         16 cols (3xdate, 4xdecimal, 4xint, 5xstring)
   Total rows:     10000000
   Partitions:     200
   Batch size:     8192
   Memory limit:   8589934592 bytes
   Iterations:     1 (warmup 1)
   
     [warmup 1/1] write: 508.480s
   ```
   
   I killed this run since the warmup took 50x longer than sort-based shuffle


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to