Omega359 commented on PR #21707: URL: https://github.com/apache/datafusion/pull/21707#issuecomment-4296727752
> One way to potentially make this easier to review is by narrowing it down to only targeting the basic framework and tpch q1 (maybe in a follow on PR) and splitting documentation into current and follow on work, 🤷🏻♂️ Most of the logic is in sql_benchmark.rs. Almost 2k lines of that file is tests. I can't really cut that down. > ## What is the end state > > My biggest question is how we see the end state of this project. Do we envision all benchmarks go through this new benchmark binary? Think of it as the equivalent to .slt tests for benchmarks with better statistics because of the use of criterion. All the existing benchmarks that can be redone to be completely sql based will be. I've got ports of clickbench, clickbench_extended, tpch, h20, imdb, hj, nlj, smj, tpcds (even taxi which df doesn't currently have) in https://github.com/Omega359/arrow-datafusion/tree/sql_benchmark/benchmarks/sql_benchmarks waiting. Some benchmarks cannot easily be converted and will remain (cancellation for example). > ## Usability Suggestions > > While I think the `cargo bench` / criterion integration is nice for integrating into benchmark tools, I think the API in this PR that mixes environment variables and criterion will be hard to use (e.g. there is no `--help`) > > I would love to see at least some more "standard" CLI that took arguments -- similiar to DuckDB's `benchmark_runner` binary -- maybe we could share the engine between that binary and the one that has a cargo interface So ... bench.sh? It's the current entrypoint and I don't see this PR changing that. > I also suggest making the .benchmark files more self contained (even if it means more repetition) -- see suggestions below Sure. These files are fairly direct ports of what duckdb has. Didn't seem all that complex to me. > ## Lots of new (duplicate) code / tests > > This seems like this PR adds _A LOT_ of new code. If this is a temporary state (aka we can remove the other dfbench implementations after porting) this is fine, but I do think we should track the work necessary to remove the duplication > > For example, do we plan to remove the current `dfbench` binary? Can we remove that (after we have ported over the other benchmarks)? Also this is like the 3rd copy of the TPCH queries we have in the repo. Can we remove the other as well Yes, if this PR is accepted I plan on creating a number of PRs for followup work - adding migrations for the other benchmarks, removing current dfbench benchmarks post merge for each benchmark, improvements to bench.sh, etc. I see this PR as the initial step in the migration of the benchmarks from being code based to being primarily sql (ish) based. It's a lot of code, yes, but it's the minimum to be able to parse and run a duckdb style benchmark. Much of it is actually just tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
