andygrove commented on PR #1600:
URL:
https://github.com/apache/datafusion-ballista/pull/1600#issuecomment-4331681814
Comet run for comparison (same workstation)
```
$ ./target/release/shuffle_bench --input
/mnt/bigdata/tpch/sf100/lineitem.parquet/ --partitioning hash --partitions
200 --hash-columns 0,3 --memory-limit 8589934592 --limit 10000000 --warmup
1 --iterations 1
=== Shuffle Benchmark ===
Input: /mnt/bigdata/tpch/sf100/lineitem.parquet/
Schema: 16 columns (3xdate, 4xdecimal, 4xint, 5xstring)
Total rows: 10,000,000
Batch size: 8,192
Partitioning: hash
Partitions: 200
Codec: Lz4Frame
Hash columns: [0, 3]
Memory limit: 8.00 GiB
Iterations: 1 (warmup: 1)
[warmup 1/1] write: 5.310s output: 660.89 MiB
[iter 1/1] write: 5.479s output: 660.90 MiB
=== Results ===
Write:
avg time: 5.479s
throughput: 1,825,306 rows/s (total across 1 tasks)
output size: 660.90 MiB
Input Metrics (last iteration):
elapsed_compute: 0.000s
output_rows: 20,534,912
output_bytes: 6.43 GiB
time_elapsed_scanning_total: 10.473s
output_batches: 2,507
metadata_load_time: 0.002s
page_index_eval_time: 0.000s
bytes_scanned: 1.22 GiB
time_elapsed_scanning_until_data: 4.651s
time_elapsed_opening: 0.032s
time_elapsed_processing: 3.020s
bloom_filter_eval_time: 0.000s
row_pushdown_eval_time: 0.000s
statistics_eval_time: 0.000s
Shuffle Metrics (last iteration):
input batches: 1,221
repart time: 0.112s (2.0%)
encode time: 2.091s (38.2%)
write time: 0.278s (5.1%)
data size: 3.14 GiB
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]