andygrove opened a new pull request, #3869: URL: https://github.com/apache/datafusion-comet/pull/3869
## Which issue does this PR close? Related to investigating off-heap memory usage in Comet vs Spark. ## Rationale for this change When running TPC-H at 1TB scale, Comet requires significantly more off-heap memory than expected (32GB+ vs 2GB for Gluten). We need tooling to measure and isolate the cause. ## What changes are included in this PR? - `benchmarks/tpc/memory-profile.sh` — Script that runs each TPC-H query individually under different configurations (Spark-only baseline, Comet with varying offHeap sizes) in local mode, wrapping each run with `/usr/bin/time -l` to capture peak RSS. Outputs a CSV for easy comparison. - `docs/memory-analysis.md` — Analysis document investigating why Comet needs more off-heap memory, covering memory pool architecture, untracked memory sources, comparison with Gluten's approach, and proposed fixes. ## How are these changes tested? These are developer tools and documentation, not production code. The script has been validated locally against TPC-H SF100. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
