andygrove commented on PR #1600:
URL:
https://github.com/apache/datafusion-ballista/pull/1600#issuecomment-4331623661
Results from running on a Linux workstation:
### sort-based shuffle
```
andy@woody:~/git/apache/datafusion-ballista$ ./target/release/shuffle_bench
--input /mnt/bigdata/tpch/sf100/lineitem.parquet/ --writer sort
--partitioning hash --partitions 200 --hash-columns 0,3 --memory-limit
8589934592 --limit 10000000 --warmup 1 --iterations 1
=== Ballista Shuffle Benchmark ===
Writer: Sort
Partitioning: Hash
Input: /mnt/bigdata/tpch/sf100/lineitem.parquet/
Schema: 16 cols (3xdate, 4xdecimal, 4xint, 5xstring)
Total rows: 10000000
Partitions: 200
Batch size: 8192
Memory limit: 8589934592 bytes
Iterations: 1 (warmup 1)
[warmup 1/1] write: 10.105s
[iter 1/1] write: 7.998s
=== Results ===
avg time: 7.998s
throughput: 1250288 rows/s (total across 1 tasks)
Shuffle metrics (last iteration):
output_rows: 10000000
write_time: 3.065s (38.3%)
input_rows: 10000000
repart_time: 1.561s (19.5%)
spill_time: 3.053s (38.2%)
spill_count: 5863
spill_bytes: 9195242224
```
### hash-based shuffle
```
andy@woody:~/git/apache/datafusion-ballista$ ./target/release/shuffle_bench
--input /mnt/bigdata/tpch/sf100/lineitem.parquet/ --writer hash
--partitioning hash --partitions 200 --hash-columns 0,3 --memory-limit
8589934592 --limit 10000000 --warmup 1 --iterations 1
=== Ballista Shuffle Benchmark ===
Writer: Hash
Partitioning: Hash
Input: /mnt/bigdata/tpch/sf100/lineitem.parquet/
Schema: 16 cols (3xdate, 4xdecimal, 4xint, 5xstring)
Total rows: 10000000
Partitions: 200
Batch size: 8192
Memory limit: 8589934592 bytes
Iterations: 1 (warmup 1)
[warmup 1/1] write: 508.480s
```
I killed this run since the warmup took 50x longer than sort-based shuffle
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]