andygrove opened a new pull request, #3909:
URL: https://github.com/apache/datafusion-comet/pull/3909

   ## Which issue does this PR close?
   
   Relates to #3882.
   
   ## Rationale for this change
   
   Issue #3882 reports that Comet shuffle files can be significantly larger 
than Spark shuffle files due to per-batch Arrow IPC format overhead. To 
investigate and measure this, we need a benchmark that compares actual shuffle 
write bytes between Spark and Comet.
   
   ## What changes are included in this PR?
   
   Adds a `shuffle-size` PySpark benchmark that:
   - Runs a scan → repartition → write pipeline
   - Queries the Spark REST API to report shuffle write bytes and bytes/record
   - Integrates with the existing benchmark framework (`run_benchmark.py`)
   - Includes a convenience shell script (`run_shuffle_size_benchmark.sh`) that 
runs the benchmark in both Spark and Comet native modes for easy comparison
   
   Usage:
   ```sh
   # Generate test data
   $SPARK_HOME/bin/spark-submit benchmarks/pyspark/generate_data.py --output 
/tmp/data --rows 200000000
   
   # Run comparison
   ./benchmarks/pyspark/run_shuffle_size_benchmark.sh /tmp/data
   ```
   
   ## How are these changes tested?
   
   This is a benchmark script, not production code. Tested manually by running 
the benchmark with 204M rows (7 string + 1 timestamp columns) and comparing 
Spark vs Comet shuffle write sizes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to