andygrove opened a new pull request, #3909: URL: https://github.com/apache/datafusion-comet/pull/3909
## Which issue does this PR close? Relates to #3882. ## Rationale for this change Issue #3882 reports that Comet shuffle files can be significantly larger than Spark shuffle files due to per-batch Arrow IPC format overhead. To investigate and measure this, we need a benchmark that compares actual shuffle write bytes between Spark and Comet. ## What changes are included in this PR? Adds a `shuffle-size` PySpark benchmark that: - Runs a scan → repartition → write pipeline - Queries the Spark REST API to report shuffle write bytes and bytes/record - Integrates with the existing benchmark framework (`run_benchmark.py`) - Includes a convenience shell script (`run_shuffle_size_benchmark.sh`) that runs the benchmark in both Spark and Comet native modes for easy comparison Usage: ```sh # Generate test data $SPARK_HOME/bin/spark-submit benchmarks/pyspark/generate_data.py --output /tmp/data --rows 200000000 # Run comparison ./benchmarks/pyspark/run_shuffle_size_benchmark.sh /tmp/data ``` ## How are these changes tested? This is a benchmark script, not production code. Tested manually by running the benchmark with 204M rows (7 string + 1 timestamp columns) and comparing Spark vs Comet shuffle write sizes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
