Omega359 commented on code in PR #21707: URL: https://github.com/apache/datafusion/pull/21707#discussion_r3127122854
########## benchmarks/sql_benchmarks/README.md: ########## @@ -0,0 +1,355 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# SQL Benchmarks + +This directory contains a collection of benchmarks each driven by a simple '.benchmark' text file and sql queries +that exercise the DataFusion execution engine against a variety of benchmark suites. The sql benchmark framework +is intentionally simple so that new benchmarks and queries can be added without touching the core engine. + +The sql benchmarks are organized in sub‑directories that correspond to the benchmark suites that are commonly used +in the community: + +| Benchmark Suite | Description | +|-----------------------|--------------------------------------------------------------------| +| `clickbench` | ClickBench benchmark | +| `clickbench extended` | 12 additional, more complex queries against the Clickbench dataset | +| `clickbench_sorted` | ClickBench benchmark using a pre-sorted hits file. | +| `h2o` | The `h2o` benchmark | +| `hj` | Hash join benchmark | +| `imdb` | IMDb benchmark | +| `nlj` | Nested‑loop join benchmark | +| `smj` | Sort‑merge join benchmark | +| `sort tpch` | Sorting benchmarks against the TPC-H lineitem table | +| `taxi` | NYC taxi dataset benchmark | +| `tpcds` | TPC‑DS queries | +| `tpch` | TPC‑H queries | + +## How it works + +SQL benchmarks are run via cargo's bench command using [criterion](https://docs.rs/criterion/latest/criterion/) +for running and gathering statistics of each sql being benchmarked. For simplicity the benchmarks/bench.sh can +be used to execute the supported benchmarks. + +Each individual benchmark is represented by a `<name>.benchmark` file that contains a number of directives instructing +the tool on how to load data, run initializations, run assertions, run the benchmark, optionally persist and +validate results, and finally run any cleanup if required. + +Variables are supported in two forms: + +* string substitution based on environment variables (with default values if unset): \${ENV_VAR} and + \${ENV_VAR:-default}. +* if / else based on whether an environment variable is true or not + (\${ENV_VAR:-default|true value|false value}). In this form only the value `true` (case-insensitive) selects the + true branch; any other set value selects the false branch. If ENV_VAR is unset, `default` is used to select the + branch. + +Comments in files are supported with lines starting with # or --. + +Many if not most of the benchmarks are set up using templates to reduce duplication across the .benchmark files. For +example here is one of the benchmark files for the h2o benchmark suite: + +``` +subgroup groupby + +template sql_benchmarks/h2o/h2o.benchmark.template +QUERY_NUMBER=1 +QUERY_NUMBER_PADDED=01 +``` + +The template directive above defines the subgroup the benchmark is part of, sets two variables (`QUERY_NUMBER` and +`QUERY_NUMBER_PADDED`) and points to a file containing more directives that are shared across the benchmark suite. + +``` +load sql_benchmarks/h2o/init/load_${BENCH_SUBGROUP:-groupby}_${BENCH_SIZE:-small}_${BENCH_FILE_TYPE:-csv}.sql + +name Q${QUERY_NUMBER_PADDED} +group h2o + +run sql_benchmarks/h2o/queries/${BENCH_SUBGROUP:-groupby}/q${QUERY_NUMBER_PADDED}.sql + +result sql_benchmarks/h2o/results/${BENCH_SUBGROUP:-groupby}/${BENCH_SIZE:-small}/q${QUERY_NUMBER_PADDED}.csv +``` + +The above showcases the use of defaults for variables: `${NAME:-default}` + +# Directives + +<table> +<tr><th>Directive</th><th>Description</th></tr> +<tr> +<td>name</td> +<td> + +The name of the benchmark. This will be used as part of the display name used by criterion.<br/><br/>Example:<br/> +<blockquote>name Q${QUERY_NUMBER_PADDED}</blockquote> + +The `name` directive also makes the value available to benchmark-file replacements as `BENCH_NAME`. This is separate +from the `BENCH_NAME` environment variable used to select which benchmark group to run. + +</td> +</tr> +<tr> +<td>group</td> +<td> + +The group name of the benchmark used for grouping benchmarks together.<br/><br/>Example:<br/> +<blockquote>group imdb</blockquote> + +</td> +</tr> +<tr> +<td>subgroup</td> +<td> + +The sub group name of the benchmark used for filtering to a specific sub group.<br/><br/>Example:<br/> +<blockquote>subgroup window</blockquote> + +</td> +</tr> +<tr> +<td>load</td> +<td> + +The load directive called during initialization of the benchmark. If a path to a file is provided on the same +line as the load directive that path will be parsed and any sql statements in that file will be executed during +initialization. If no path is specified the next line is required to be the sql statement to execute. <br/> <br/> The +load directive (including any following sql statement) must be followed by a blank line. <br/><br/>Example:<br/> +<blockquote>load sql_benchmarks/h2o/init/load_${BENCH_SUBGROUP:-groupby}_${BENCH_SIZE:-small}_${BENCH_FILE_TYPE:-csv}.sql</blockquote> +or +<blockquote> +load<br/> +CREATE TABLE test AS (SELECT value as key FROM range(1000000) ORDER BY value); +</blockquote> + +</td> +</tr> +<tr> +<td>init</td> +<td> + +The init directive is called after the load directive prior to benchmark execution. If a path to a file is +provided on the same line as the init directive that path will be parsed and any sql statements in that file will be +executed during the benchmark initialization. If no path is specified the next line is required to be the sql statement +to execute.<br/><br/> The init directive (including any following sql statement) must be followed by a blank +line.<br/><br/>Example:<br/> +<blockquote> +init<br/> +set datafusion.execution.parquet.binary_as_string = true; +</blockquote> + +</td> +</tr> +<tr> +<td>run</td> +<td> + +The run directive called during execution of the benchmark. If a path to a file is provided on the same line as +the run directive that path will be parsed and any sql statements in that file will be executed during the benchmark +run. If no path is specified the next line is required to be the sql statement to execute. <br/><br/> Multiple +statements are allowed within a single run directive, however a benchmark file may contain only one run directive. When +running with `BENCH_PERSIST_RESULTS` or `BENCH_VALIDATE`, only the last `SELECT` or `WITH` statement from that run +directive will be used for comparison. <br/><br/> The run directive (including any following sql statement) must be +followed by a blank line.<br/><br/>Example:<br/> +<blockquote>run sql_benchmarks/imdb/queries/${QUERY_NUMBER_PADDED}.sql</blockquote> + +</td> +</tr> +<tr> +<td>cleanup</td> +<td> + +The cleanup directive is called after all other directives and can be used to cleanup after the benchmark - +e.g. to drop tables. If a path to a file is provided on the same line as the cleanup directive that path will be parsed +and any sql statements in that file will be executed during cleanup. If no path is specified the next line is +required to be the sql statement to execute. <br/> <br/> The cleanup directive (including any following sql statement) +must be followed by a blank line. <br/><br/>Example:<br/> +<blockquote> +cleanup<br/> +DROP TABLE test; +</blockquote> + +</td> +</tr> +<tr> +<td>expect_plan</td> +<td> + +The expect_plan directive will check the physical plan for the string provided on the same line. This +can be used to validate that a particular join was used. <br/> <br/> Example:<br/> +<blockquote>expect_plan NestedLoopJoinExec</blockquote> + +</td> +</tr> +<tr> +<td>assert</td> +<td> + +The assert directive is run between the init and run directives and can be used to validate system state correctness +prior to running the benchmark sql. The format is +<blockquote> +assert II<br/> +SELECT name, value = 3 FROM information_schema.df_settings WHERE name IN ('datafusion.execution.target_partitions', 'datafusion.execution.planning_concurrency');<br/> +----<br/> +datafusion.execution.planning_concurrency true<br/> +datafusion.execution.target_partitions true<br/> +</blockquote> + +The number of I's corresponds to the number of columns in the result. The expected results can be either tab delimited +or pipe delimited. + +</td> +</tr> +<tr> +<td>result_query</td> +<td> + +The result_query directive is run during the verify phase and can be used to verify a different set of results than any +that might come from queries executed from the `run` directive. The format is the same as the `assert` directive +above.<br/><br/>Example: +<blockquote> +result_query III<br/> +SELECT COUNT(DISTINCT id2), SUM(r4), COUNT(*) FROM answer;<br/> +----<br/> +123 345 45 +</blockquote> + + +Note that the results of the run query are not automatically stored into a table in datafusion. If you want to +verify a result from queries executed from the `run` directive those queries will have to be saved to a table directly +using `CREATE TABLE AS (..)` or similar. + +</td> +</tr> +<tr> +<td>result</td> +<td> + +The result directive declares the expected result file used during verification. A path to a file is required on the +same line as the result directive. The file is parsed only during verification, and must be a pipe-delimited CSV file +with a header row. During verification, these expected rows are compared with the rows produced by the last saved +`SELECT` or `WITH` statement from the `run` directive. <br/><br/>Example:<br/> +<blockquote> +result sql_benchmarks/imdb/results/${QUERY_NUMBER_PADDED}.csv +</blockquote> + +</td> +</tr> +<tr> +<td>template</td> +<td> + +The template directive allows for inclusion of another file in a benchmark file. A path to a file is +required on the same line as the template directive which will be parsed as a benchmark file. Parameters can be passed +to the template file using the format `KEY=value`, one per line after the template directive followed by a blank line. +<br/><br/>Example:<br/> +<blockquote> +template sql_benchmarks/smj/smj.benchmark.template<br/> +QUERY_NUMBER=1<br/> +QUERY_NUMBER_PADDED=01 +</blockquote> + +</td> +</tr> +<tr> +<td>include</td> +<td>The include directive is similar to the template directive except that it does not support parameters.</td> +</tr> +<tr> +<td>echo</td> +<td> + +The echo directive allows for echoing a string to stdout during the execution of the benchmark and may be useful for +debugging.<br/><br/>Example:<br/> +<blockquote> +echo The value for batch size is ${BATCH_SIZE:-8192} +</blockquote> + +</td> +</tr> +</table> + +# Run a single benchmark suite + +```shell +BENCH_NAME=tpch cargo bench --bench sql +``` + +As you can see above the actual benchmark suite to run is set via an environment variable. Cargo's bench command and +criterion have an unfortunate limitation in that custom command arguments cannot be passed +into a benchmark. The alternative is to use environment variables to pass in arguments which is what is used here. +The SQL benchmarking tool uses the following environment variables: + +| Environment Variable | Description | +|---------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| BENCH_NAME | The name of the benchmark suite to run. For example 'imdb'. This should correspond to a directory name in the `sql_benchmarks` directory. | +| BENCH_SUBGROUP | The subgroup with the benchmark suite to run. For example 'window' to run the window subgroup of the h2o benchmark. | +| BENCH_QUERY | A query number to run. | +| BENCH_PERSIST_RESULTS | true/false to persist benchmark results. Results will be persisted in csv format so be cognizant of the size of the results. | +| BENCH_VALIDATE | true/false to validate benchmark results against persisted results or result_query's. If both `BENCH_PERSIST_RESULTS` and `BENCH_VALIDATE` are true, persist mode runs and validation is skipped. | +| SIMULATE_LATENCY | Simulate object store latency to mimic remote storage (e.g. S3). Adds random latency in the range 20-200ms to each object store operation. | +| PARTITIONS | Number of partitions to process in parallel. Defaults to number of available cores. | +| BATCH_SIZE | Batch size when reading CSV or Parquet files. | +| MEM_POOL_TYPE | The memory pool type to use, should be one of "fair" or "greedy". | +| MEMORY_LIMIT | Memory limit (e.g. '100M', '1.5G'). If not specified, run all pre-defined memory limits for given query if there's any, otherwise run with no memory limit. | +| DATAFUSION_RUNTIME_MEMORY_LIMIT | Used if MEMORY_LIMIT is not set. | +| SORT_SPILL_RESERVATION_BYTES | The amount of memory to reserve for sort spill operations. DataFusion's default value will be used if not specified. | + +Example – Run the H2O window benchmarks on the 'small' sized CSV data files: + +``` bash +export BENCH_NAME=h2o Review Comment: I redid the readme.md a bit to hopefully address this. It's better but I suspect someone better at writing can help a lot here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
