Omega359 commented on PR #21707:
URL: https://github.com/apache/datafusion/pull/21707#issuecomment-4296727752

   > One way to potentially make this easier to review is by narrowing it down 
to only targeting the basic framework and tpch q1 (maybe in a follow on PR) and 
splitting documentation into current and follow on work,
   
   🤷🏻‍♂️ Most of the logic is in sql_benchmark.rs. Almost 2k lines of that file 
is tests. I can't really cut that down.
   
   > ## What is the end state
   > 
   > My biggest question is how we see the end state of this project. Do we 
envision all benchmarks go through this new benchmark binary?
   
   Think of it as the equivalent to .slt tests for benchmarks with better 
statistics because of the use of criterion. All the existing benchmarks that 
can be redone to be completely sql based will be. I've got ports of clickbench, 
clickbench_extended, tpch, h20, imdb, hj, nlj, smj, tpcds (even taxi which df 
doesn't currently have) in 
https://github.com/Omega359/arrow-datafusion/tree/sql_benchmark/benchmarks/sql_benchmarks
 waiting. Some benchmarks cannot easily be converted and will remain 
(cancellation for example).
   
   > ## Usability Suggestions
   > 
   > While I think the `cargo bench` / criterion integration is nice for 
integrating into benchmark tools, I think the API in this PR that mixes 
environment variables and criterion will be hard to use (e.g. there is no 
`--help`)
   > 
   > I would love to see at least some more "standard" CLI that took arguments 
-- similiar to DuckDB's `benchmark_runner` binary -- maybe we could share the 
engine between that binary and the one that has a cargo interface
   
   So ... bench.sh? It's the current entrypoint and I don't see this PR 
changing that.
   
   > I also suggest making the .benchmark files more self contained (even if it 
means more repetition) 
   -- see suggestions below
   
   Sure. These files are fairly direct ports of what duckdb has. Didn't seem 
all that complex to me.
   
   > ## Lots of new (duplicate) code / tests
   > 
   > This seems like this PR adds _A LOT_ of new code. If this is a temporary 
state (aka we can remove the other dfbench implementations after porting) this 
is fine, but I do think we should track the work necessary to remove the 
duplication
   > 
   > For example, do we plan to remove the current `dfbench` binary? Can we 
remove that (after we have ported over the other benchmarks)? Also this is like 
the 3rd copy of the TPCH queries we have in the repo. Can we remove the other 
as well
   
   Yes, if this PR is accepted I plan on creating a number of PRs for followup 
work - adding migrations for the other benchmarks, removing current dfbench 
benchmarks post merge for each benchmark, improvements to bench.sh, etc.
   
   I see this PR as the initial step in the migration of the benchmarks from 
being code based to being primarily sql (ish) based. It's a lot of code, yes, 
but it's the minimum to be able to parse and run a duckdb style benchmark. Much 
of it is actually just tests
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to