This is an automated email from the ASF dual-hosted git repository. zykkk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push: new 89fff98ced [docs](docs)Update spark_load.md (#22428) 89fff98ced is described below commit 89fff98ced631a681408b0114df45ae9390b7595 Author: Hu Yanjun <100749531+httpshir...@users.noreply.github.com> AuthorDate: Fri Aug 4 19:22:52 2023 +0800 [docs](docs)Update spark_load.md (#22428) --- docs/en/community/design/spark_load.md | 220 +++++- .../developer-guide/regression-testing.md | 836 ++++++++++++++++++++- 2 files changed, 1053 insertions(+), 3 deletions(-) diff --git a/docs/en/community/design/spark_load.md b/docs/en/community/design/spark_load.md index 9e630f4393..7e5b9b155a 100644 --- a/docs/en/community/design/spark_load.md +++ b/docs/en/community/design/spark_load.md @@ -24,6 +24,222 @@ specific language governing permissions and limitations under the License. --> -# Spark load +# Implementation Design for Spark Load -TODO +## Background + +Doris supports various data ingestion methods, including Broker Load, Routine Load, Stream Load, and Mini Batch Load. +Spark Load is the optimal choice for first-time data ingestion into Doris. It is for moving large volumes of data and can speed up data ingestion. + +## Basic Concepts + +* FE: Frontend, the frontend nodes of Doris. They process request from clients, manage metadata and clusters, and generate query execution plans. +* BE: Backend, the backend nodes of Doris. They store and manage data, and execute query plans. +* Tablet: A table in Doris is divided into tablets. +* DPP: Data preprocessing. To transfer, clean, partition, sort, and aggregate data using external computing resources (Hadoop, Spark) + +## Design + +### Objective + +Doris itself is relatively slow in batch import of 100GB data or larger. As it hasn't realized read/writing separation. Such batch import can lead to high CPU usage. For users who need to migrate large data volumes, Doris provides the Spark Load option. It leverages the data loading and concurrency capabilities of Spark clusters to perform ETL, sorting, and aggregation upon data ingestion. In this way, users can migrate huge datasets faster and less costly. + +The Spark Load method must be compatible with the multiple deployment modes of Spark. The first step is to support YARN clusters. Meanwhile, considering the variety of data formats, it needs to support multiple kinds of data files, including CSV, Parquet, and ORC. + +### Implementation Plans + +When designing the implementation of Spark Load, we take into account the our existing data loading framework. + +#### Plan A + +For the ease of development, we hope to reuse the existing data loading framework as much as possible, so this is the overall plan for implementing Spark Load: + +The Spark Load statement input by the user will be parsed to produce LoadStmt. An identifier field `isSparkLoad` will be added to the LoadStmt. If it is true, the system will create a SparkLoadJob (similar to how a BrokerLoadJob is created). The job will be executed by a state machine mechanism. In PENDING stage, a SparkLoadPendingTask will be created; in LOADING stage, a LoadLoadingTask will be created for data loading. Doris reuses its existing data loading framework and executes the [...] + +A few considerations for this plan: + +##### Syntax + +The Spark Load statement is similar to that of Broker Load. This is an example: + +``` + LOAD LABEL example_db.label1 + ( + DATA INFILE("hdfs://hdfs_host:hdfs_port/user/palo/data/input/file") + NEGATIVE + INTO TABLE `my_table` + PARTITION (p1, p2) + COLUMNS TERMINATED BY "," + columns(k1,k2,k3,v1,v2) + set ( + v3 = v1 + v2, + k4 = hll_hash(k2) + ) + where k1 > 20 + ) + with spark.cluster_name + PROPERTIES + ( + "spark.master" = "yarn", + "spark.executor.cores" = "5", + "spark.executor.memory" = "10g", + "yarn.resourcemanager.address" = "xxx.tc:8032", + "max_filter_ratio" = "0.1", + ); +``` + +`spark.cluster_name` is the name of the Spark cluster to be loaded. It can be set via SET PROPERTY and users may refer to the configurations of the original Hadoop cluster. The Spark cluster configurations in the PROPERTIES will overwrite those in `spark.cluster_name`. + +Explanation of the properties: + +- spark.master: Spark cluster deployment mode. Supported modes include yarn/standalone/local/k8s. We will prioritize the support for yarn and implement the yarn-cluster mode. (The yarn-client mode is mainly used in interactive scenarios.) +- spark.executor.cores: the number of CPUs in the executor +- spark.executor.memory: the memory size of the executor +- yarn.resourcemanager.address: address of the resourcemanager of the specified YARN +- max_filter_ratio: the upper limit of the filter ratio + +##### SparkLoadJob + +After a user sends a Spark Load statement, the statement will be parsed and a SparkLoadJob will be created. + +``` +SparkLoadJob: + +-------+-------+ + | PENDING |-----------------| + +-------+-------+ | + | SparkLoadPendingTask | + v | + +-------+-------+ | + | LOADING |-----------------| + +-------+-------+ | + | LoadLoadingTask | + v | + +-------+-------+ | + | COMMITTED |-----------------| + +-------+-------+ | + | | + v v + +-------+-------+ +-------+-------+ + | FINISHED | | CANCELLED | + +-------+-------+ +-------+-------+ + | Λ + +-------------------------+ +``` + +The above is the execution process of a SparkLoadJob. + +##### SparkLoadPendingTask + +A SparkLoadPendingTask is to commit a Spark ETL job to the Spark cluster. Since Spark supports various deployment modes (localhost, standalone, yarn, k8s), we need an abstraction of a generic interface for Spark ETL. The main tasks include: + +- Commit a Spark ETL job +- Cancel a Spark ETL job +- Obtain status of the Spark ETL job + +A prototype of the interface: + +``` +class SparkEtlJob { + // Commit a Spark ETL job + // Return JobId + String submitJob(TBrokerScanRangeParams params); + + // Cancel a Spark ETL job + bool cancelJob(String jobId); + + // Obtain status of the Spark ETL job + JobStatus getJobStatus(String jobId); +private: + std::list<DataDescription> data_descriptions; +}; +``` + +We hope to support various sub-classes to support different cluster deployment modes. And we hope to implement SparkEtlJobForYarn to support Spark Load for YARN clusters. The JobId in the above snippet is the aphid of the YARN cluster. One way to obtain the appid is to commit a Spark Job via the spark-submit client, analyze the standard error output, and retrieve the appid by text matching. + +However, based on lessons drawn from Hadoop DPP jobs, we need to take care of job queuing issues: due to data volume or cluster queue reasons, the number of concurrent loading jobs might hit the limit and subsequent job commits might fail. One solution is to separately set a limit on the number of concurrent Spark Load jobs and a per-user concurrency limit, so as to avoid avoid interference between jobs of different users. + + + +Additionally, implementing a job scheduling system that prioritizes jobs based on their importance and allocates resources accordingly can ensure that critical jobs are processed first and that resources are used efficiently. + + + +Key steps of executing a Spark ETL job include: + +1. Type conversion(extraction/transformation) + + Transfer the fields in the source file into column types (check if the fields are legal for function computation, etc.) + +2. Function computation (transformation), including negative computation + + Complete computation of the specified function. Function list: "strftime", "time_format", "alignment_timestamp", "default_value", "md5sum", "replace_value", "now", "hll_hash", "substitute". + +3. "columns from path" extraction + +4. "where" filtering + +5. Data partitioning and bucketing + +6. Data sorting and pre-aggregation + + + + The OlapTableSink includes data sorting and aggregation, so logically we don't need to implement them separately, **but we decide to include them in the Spark ETL job in order to improve BE efficiency by skipping these steps during loading in BE.** We might change that based on future test performance. + + + + Another nut to crack to support global dictionary for bitmap since the bitmap columns of string type require global dictionary. + + + + To indicate whether the data sorting and aggregation have been completed, we plan to generate a a job.json description file upon job completion, which includes the following properties: + + ``` + { + "is_segment_file" : "false", + "is_sort" : "true", + "is_agg" : "true", + } + ``` + + Explanation: + is_sort: if the data is sorted + is_agg: if the data is aggregated + is_segment_file: if the generated files are segment files + +7. Since the computation of Rollup is based on the base table, to optimize the generation of Rollups, we need to look into the index hierarchy. + +The tricky part is to support expression calculation for columns. + +Lastly, for the storage format of the output file after a Spark Load job is finished, we plan to support CSV, Parquet, and ORC, and make Parquet the default option for higher storage efficiency. + +##### LoadLoadingTask + +For the LoadLoadingTask, we can reuse our existing logic. What's different from BrokerLoadJob is, LoadLoadingTask doesn't need to include column mapping, function, computation, negative import, filtering, or aggregation, because they have all been done in SparkEtlTask. A LoadLoadingTask can just include simple column mapping and type conversion. + +##### Data loading to BE + +For this part, we can fully reuse the existing framework. + +#### Plan B + +Plan A can reuse our data loading framework to a great extend and realize support for large data import as soon as possible. But the downside is, data processed by the Spark ETL job is already partitioned into tablets, but our existing Broker Load framework will still partition and bucket the data, serialize it, and then send data to the target BE nodes via RPC. The process generates extra overheads. + +So we have a second plan. Plan B is to generate Segment files of the Doris storage format after the SparkEtlJob. And then the three replicas should complete file loading via the add_rowset interface under a clone-like mechanism. The difference between Plan B and Plan A is that, in Plan B, you need to: + +1. Add a tabletid suffix to the generated files +2. Add an interface to SparkLoadPendingTask class: Map<long, Pair<String, Long>> getFilePathMap(). The interface returns the mapping relations between tabletid and files. +3. Add a "spark_push" interface to BE RPC service, in order to pull the transformed files after ETL to local machines (read via Broker) and then load data via the "add_rowset" interface using clone-like logic. +4. Generate a new loading task: SparkLoadLoadingTask. The task is to read job.json files, parse the properties, and use the properties as RPC parameters. Then it calls the "spark_push" interface and sends a data loading request to the BE where the tablet is. The "spark_push" interface in BE will decide what to do based on the "is_segment_file" parameter. If it is true, it will download the segment file and add rowset; if it is false, it will follow the pusher logic and execute data loading. + +Plan B also moves the segment file generation workload to the Spark cluster, which can greatly reduce the burden on the Doris cluster and thus improve efficiency. However, in Plan B, the underlying rowset and the segment v2 interface will be packed into individual SO files, and we need to use Spark to call that interface to transfer data into segment files. + + relies on packaging the underlying rowset and segment v2 interfaces into independent shared object (SO) files, and using Spark to call these interfaces to transform data into segment files. + +## Conclusion + +As we compare two plans, Plan A allows for the least development work but the BE will undertake unnecessary workloads, while for Plan B, we can make use of our existing Hadoop data loading framework. Therefore, we plan to implement Spark Load in two steps. + +Step 1: Go with Plan B and complete data partitioning, sorting, and aggregation in Spark, and generate Parquet files. Then we follow the Hadoop pusher process and transfer data formats in the BE. + +Step 2: Encapsulate the database of segment writing, generate Doris formats directly, and add an RPC interface to implement clone-like data loading logic. diff --git a/docs/en/community/developer-guide/regression-testing.md b/docs/en/community/developer-guide/regression-testing.md index c5981fd3ca..46acaba921 100644 --- a/docs/en/community/developer-guide/regression-testing.md +++ b/docs/en/community/developer-guide/regression-testing.md @@ -27,4 +27,838 @@ under the License. # Regression Testing -TODO \ No newline at end of file +## Basic Concepts + +1. `Suite`: A test case (the file name) +2. `Group`: A test set (the directory that the test case belongs to) +3. `Action`: An encapsulated test action, such as `sql_action` for SQL execution, `test_action` for result verification, and `streamLoad_action` for data ingestion. + +## Steps + +1. Get the clusters ready +2. Modify the configuration files `${DORIS_HOME}/regression-test/conf/regression-conf.groovy`, set items, such as JDBC URL and user +3. Create the test case files and write the test cases +4. If a test case includes a `qt` Action, you also need to create the relevant data files. For example, the case `suites/demo/qt_action.groovy` will require a TSV file `data/demo/qt_action.out` for output verification. +5. Run `${DORIS_HOME}/run-regression-test.sh` to test all cases, or run `${DORIS_HOME}/run-regression-test.sh --run <suiteName>` to test a few cases. For more examples, please refer to the "Startup Script Examples" section on this page. + +## Directory Structure + +Key files and directories to pay attention to: + +1. `run-regression-test.sh`: Startup script +2. `regression-conf.groovy`: Default configurations for regression test +3. `data`: where the the input and output data is +4. `suites`: where test cases are + +``` +./${DORIS_HOME} + |-- **run-regression-test.sh** Startup script for regression test + |-- regression-test + | |-- plugins Plug-ins + | |-- conf + | | |-- logback.xml Log configurations + | | |-- **regression-conf.groovy** Default configurations + | | + | |-- framework framework source code for regression test + | |-- **data** Input and output data files of test cases + | | |-- demo Input and output data files of demo + | | |-- correctness Input and output data files of correctness test cases + | | |-- performance Input and output data files of performance test cases + | | |-- utils Input and output data files of other utilities + | | + | |-- **suites** Test cases for regression testing + | |-- demo Demo of test cases + | |-- correctness Test cases of correctness tests + | |-- performance Test cases of performance tests + | |-- utils Other utilities + | + |-- output + |-- regression-test + |-- log Logs of regression testing +``` + + +## Default Configurations of the Framework + +Modify the JDBC and FE configurations based on the case. + +```groovy +/* ============ Key parts to pay attention to ============ */ +// The default database, which will be created if the user does not create a database +defaultDb = "regression_test" + +// JDBC configurations +jdbcUrl = "jdbc:mysql://127.0.0.1:9030/?" +jdbcUser = "root" +jdbcPassword = "" + +// FE address configurations, used for Stream Load +feHttpAddress = "127.0.0.1:8030" +feHttpUser = "root" +feHttpPassword = "" + +/* ============ Configurations that usually do not require modifications ============ */ + +// DORIS_HOME is loaded by run-regression-test.sh +// which is java -DDORIS_HOME=./ + +// Set the path for the test cases +suitePath = "${DORIS_HOME}/regression-test/suites" +// Set the path for the input and output data +dataPath = "${DORIS_HOME}/regression-test/data" +// Set the path for the plug-ins +pluginPath = "${DORIS_HOME}/regression-test/plugins" + +// By default, all groups will be read. The groups are separated by comma. For example, "demo,performance". +// This is dynamically specified and overwritten via run-regression-test.sh --run -g +testGroups = "" +// By default, all suites will be read. This is dynamically specified and overwritten via run-regression-test.sh --run -s +testSuites = "" +// Directories specified in this parameter will be loaded by default. It can be dynamically specified and overwritten via run-regression-test.sh --run -d +testDirectories = "" + +// Groups specified in this parameter will be excluded. It can be dynamically specified and overwritten via run-regression-test.sh --run -xg +excludeGroups = "" +// Suites specified in this parameter will be excluded. It can be dynamically specified and overwritten via run-regression-test.sh --run -xs +excludeSuites = "" +// Directories specified in this parameter will be excluded. It can be dynamically specified and overwritten via run-regression-test.sh --run -xd +excludeDirectories = "" + +// Other self-defined configurations +customConf1 = "test_custom_conf_value" +``` + +## Steps to Write Test Cases + +1. Enter the `${DORIS_HOME}/regression-test` directory +2. Choose a directory based on the test. For correctness test, place the test in `suites/correctness`; for performance test, put the test case in `suites/performance`. +3. Create a groovy test case file, and add a few `Action` for testing. + +## Action + +`Action` refers to a test action, which is defined by DSL and provided in the test framework. + +### sql action + +A `sql_action` is used to commit SQL and obtain results. If the query fails, an error will be thrown. + +Parameters: + +- String sql: the input SQL string +- `return List<List<Object>>`: the query result. If it is DDL/DML, a result of one row and one column will be returned and the value will be updateRowCount. + +You can find the following sample code in `${DORIS_HOME}/regression-test/suites/demo/sql_action.groovy`: + +```groovy +suite("sql_action", "demo") { + // execute sql and ignore result + sql "show databases" + + // execute sql and get result, outer List denote rows, inner List denote columns in a single row + List<List<Object>> tables = sql "show tables" + + // assertXxx() will invoke junit5's Assertions.assertXxx() dynamically + assertTrue(tables.size() >= 0) // test rowCount >= 0 + + // syntax error + try { + sql "a b c d e" + throw new IllegalStateException("Should be syntax error") + } catch (java.sql.SQLException t) { + assertTrue(true) + } + + def testTable = "test_sql_action1" + + try { + sql "DROP TABLE IF EXISTS ${testTable}" + + // multi-line sql + def result1 = sql """ + CREATE TABLE IF NOT EXISTS ${testTable} ( + id int + ) + DISTRIBUTED BY HASH(id) BUCKETS 1 + PROPERTIES ( + "replication_num" = "1" + ) + """ + + // DDL/DML return 1 row and 1 column, the only value is update row count + assertTrue(result1.size() == 1) + assertTrue(result1[0].size() == 1) + assertTrue(result1[0][0] == 0, "Create table should update 0 rows") + + def result2 = sql "INSERT INTO test_sql_action1 values(1), (2), (3)" + assertTrue(result2.size() == 1) + assertTrue(result2[0].size() == 1) + assertTrue(result2[0][0] == 3, "Insert should update 3 rows") + } finally { + /** + * try_xxx(args) means: + * + * try { + * return xxx(args) + * } catch (Throwable t) { + * // do nothing + * return null + * } + */ + try_sql("DROP TABLE IF EXISTS ${testTable}") + + // you can see the error sql will not throw exception and return + try { + def errorSqlResult = try_sql("a b c d e f g") + assertTrue(errorSqlResult == null) + } catch (Throwable t) { + assertTrue(false, "Never catch exception") + } + } + + // order_sql(sqlStr) equals to sql(sqlStr, isOrder=true) + // sort result by string dict + def list = order_sql """ + select 2 + union all + select 1 + union all + select null + union all + select 15 + union all + select 3 + """ + + assertEquals(null, list[0][0]) + assertEquals(1, list[1][0]) + assertEquals(15, list[2][0]) + assertEquals(2, list[3][0]) + assertEquals(3, list[4][0]) +} +``` + +### qt action + +A `qt_action` is used to commit SQL and verify the output results based on the `.out` TSV file. + +- String sql: the input SQL string +- return void + +You can find the following sample code in `${DORIS_HOME}/regression-test/suites/demo/qt_action.groovy`: + +```groovy +suite("qt_action", "demo") { + /** + * qt_xxx sql equals to quickTest(xxx, sql) in which xxx is tag. + * The result will be compared to the relevant file: ${DORIS_HOME}/regression_test/data/qt_action.out. + * + * If you want to generate .out tsv file, you can run -genOut or -forceGenOut. + * e.g + * ${DORIS_HOME}/run-regression-test.sh --run qt_action -genOut + * ${DORIS_HOME}/run-regression-test.sh --run qt_action -forceGenOut + */ + qt_select "select 1, 'beijing' union all select 2, 'shanghai'" + + qt_select2 "select 2" + + // Order result by string dict then compare to .out file. + // order_qt_xxx sql equals to quickTest(xxx, sql, true). + order_qt_union_all """ + select 2 + union all + select 1 + union all + select null + union all + select 15 + union all + select 3 + """ +} +``` + +### test action + +A `test_action` is to test with more complicated verification rules, such as to verify the number of rows, execution time, and the existence of errors + +Parameters: + +- String sql: the input SQL string +- `List<List<Object>> result`: provide a list to check if the output is equal to that list +- `Iterator<Object> resultIterator`: provide an iterator to check if the output is equal to that iterator +- String resultFile: provide a file URI (a relative path of a local file or a http(s) path) to check if the output is equal to the http response stream. The format is similar to that of the `.out` file, but there is no header or comment. +- String exception: check if the error thrown includes certain strings +- long rowNum: check the number of rows +- long time: check if the execution time is shorter than this value, which is measured in millisecond +- `Closure<List<List<Object>>, Throwable, Long, Long> check`: self-defined callback validation function, in which you can input the query result, error thrown, and response time. When there is a callback validation function, other verification methods will fail. + +You can find the following sample code in `${DORIS_HOME}/regression-test/suites/demo/test_action.groovy`: + +```groovy +suite("test_action", "demo") { + test { + sql "abcdefg" + // check exception message contains + exception "errCode = 2, detailMessage = Syntax error" + } + + test { + sql """ + select * + from ( + select 1 id + union all + select 2 + ) a + order by id""" + + // multi check condition + + // check return 2 rows + rowNum 2 + // execute time must <= 5000 millisecond + time 5000 + // check result, must be 2 rows and 1 column, the first row is 1, second is 2 + result( + [[1], [2]] + ) + } + + test { + sql "a b c d e f g" + + // other check will not work because already declared a check callback + exception "aaaaaaaaa" + + // callback + check { result, exception, startTime, endTime -> + // assertXxx() will invoke junit5's Assertions.assertXxx() dynamically + assertTrue(exception != null) + } + } + + test { + sql """ + select 2 + union all + select 1 + union all + select null + union all + select 15 + union all + select 3 + """ + + check { result, ex, startTime, endTime -> + // same as order_sql(sqlStr) + result = sortRows(result) + + assertEquals(null, result[0][0]) + assertEquals(1, result[1][0]) + assertEquals(15, result[2][0]) + assertEquals(2, result[3][0]) + assertEquals(3, result[4][0]) + } + } + + // execute sql and order query result, then compare to iterator + def selectValues = [1, 2, 3, 4] + test { + order true + sql selectUnionAll(selectValues) + resultIterator(selectValues.iterator()) + } + + // compare to data/demo/test_action.csv + test { + order true + sql selectUnionAll(selectValues) + + // you can set to http://xxx or https://xxx + // and compare to http response body + resultFile "test_action.csv" + } +} +``` + +### explain action + +An `explain_action` is used to check if the returned string contains certain strings. + +Parameters: + +- String sql: the SQL that needs to be explained +- String contains: check if certain strings are included. You can check for multiple strings at a time. +- String notContains: check if certain strings are not included. You can check for multiple strings at a time. +- `Closure<String> check`: self-defined callback validation function, by which you can obtain the returned string. When there is a callback validation function, other verification methods will fail. +- `Closure<String, Throwable, Long, Long> check`: self-defined callback validation function, by which you can obtain the error thrown and response time + +You can find the following sample code in `${DORIS_HOME}/regression-test/suites/demo/explain_action.groovy`: + +```groovy +suite("explain_action", "demo") { + explain { + sql("select 100") + + // contains("OUTPUT EXPRS:<slot 0> 100\n") && contains("PARTITION: UNPARTITIONED\n") + contains "OUTPUT EXPRS:<slot 0> 100\n" + contains "PARTITION: UNPARTITIONED\n" + } + + explain { + sql("select 100") + + // contains(" 100\n") && !contains("abcdefg") && !("1234567") + contains " 100\n" + notContains "abcdefg" + notContains "1234567" + } + + explain { + sql("select 100") + // simple callback + check { explainStr -> explainStr.contains("abcdefg") || explainStr.contains(" 100\n") } + } + + explain { + sql("a b c d e") + // callback with exception and time + check { explainStr, exception, startTime, endTime -> + // assertXxx() will invoke junit5's Assertions.assertXxx() dynamically + assertTrue(exception != null) + } + } +} +``` + +### streamLoad action + +A `streamLoad_action` is used to ingest data. + +Parameters: + +- String db: database, set to the defaultDb in regression-conf.groovy +- String table: table name +- String file: the path of the file to be loaded. It can be a relative path under the data directory, or a HTTP URL. +- `Iterator<List<Object>> inputIterator`: the iterator to be loaded +- String inputText: the text to be loaded (rarely used) +- InputStream inputStream: the stream to be loaded (rarely used) +- long time: check if the exection time is shorter than this value, which is measure in millisecond +- void set(String key, String value): set the header of the HTTP request for Stream Load, such as label and columnSeparator. +- `Closure<String, Throwable, Long, Long> check`: self-defined callback validation function, by which you can obtain the returned result, error thrown, and response time. When there is a callback validation function, other verification methods will fail. + +You can find the following sample code in `${DORIS_HOME}/regression-test/suites/demo/streamLoad_action.groovy`: + +```groovy +suite("streamLoad_action", "demo") { + + def tableName = "test_streamload_action1" + + sql """ + CREATE TABLE IF NOT EXISTS ${tableName} ( + id int, + name varchar(255) + ) + DISTRIBUTED BY HASH(id) BUCKETS 1 + PROPERTIES ( + "replication_num" = "1" + ) + """ + + streamLoad { + // you can skip declare db, because a default db already specify in ${DORIS_HOME}/conf/regression-conf.groovy + // db 'regression_test' + table tableName + + // default label is UUID: + // set 'label' UUID.randomUUID().toString() + + // default column_separator is specify in doris fe config, usually is '\t'. + // this line change to ',' + set 'column_separator', ',' + + // relate to ${DORIS_HOME}/regression-test/data/demo/streamload_input.csv. + // also, you can stream load a http stream, e.g. http://xxx/some.csv + file 'streamload_input.csv' + + time 10000 // limit inflight 10s + + // stream load action will check result, include Success status, and NumberTotalRows == NumberLoadedRows + } + + + // stream load 100 rows + def rowCount = 100 + // range: [0, rowCount) + // or rangeClosed: [0, rowCount] + def rowIt = range(0, rowCount) + .mapToObj({i -> [i, "a_" + i]}) // change Long to List<Long, String> + .iterator() + + streamLoad { + table tableName + // also, you can upload a memory iterator + inputIterator rowIt + + // if declared a check callback, the default check condition will ignore. + // So you must check all condition + check { result, exception, startTime, endTime -> + if (exception != null) { + throw exception + } + log.info("Stream load result: ${result}".toString()) + def json = parseJson(result) + assertEquals("success", json.Status.toLowerCase()) + assertEquals(json.NumberTotalRows, json.NumberLoadedRows) + assertTrue(json.NumberLoadedRows > 0 && json.LoadBytes > 0) + } + } +} +``` + +### Other Actions + +Other actions include thread, lazyCheck, events, connect, and selectUnionAll. You can find examples for them in `${DORIS_HOME}/regression-test/suites/demo`. + +## Startup Script Examples + +```shell +# View parameter descriptions for scripts +./run-regression-test.sh h + +# View parameter descriptions for framework +./run-regression-test.sh --run -h + +# Test all cases +./run-regression-test.sh + +# Delete test framework compilation results and test logs +./run-regression-test.sh --clean + +# Test the suite named "sql_action". Currently, the test case file suffix is the same as the suiteName, so the corresponding test case file in this case is sql_action.groovy. +./run-regression-test.sh --run sql_action + +# Test the suite whose name contains 'sql'. **Remember to use single quotation marks.** +./run-regression-test.sh --run '*sql*' + +# Test the "demo" and "perfomance" group +./run-regression-test.sh --run -g 'demo,performance' + +# Test the suite "sql_action" in the "demo" group +./run-regression-test.sh --run -g demo -s sql_action + +# Test the suite "sql_action" in the demo directory +./run-regression-test.sh --run -d demo -s sql_action + +# Test the suites under the demo directory except the "sql_action" suite +./run-regression-test.sh --run -d demo -xs sql_action + +# Exclude the test cases in the demo directory +./run-regression-test.sh --run -xd demo + +# Exclude the test cases in the "demo" group +./run-regression-test.sh --run -xg demo + +# Self-defined configurations +./run-regression-test.sh --run -conf a=b + +# Parallel execution +./run-regression-test.sh --run -parallel 5 -suiteParallel 10 -actionParallel 20 +``` + +## Generate `.out` File Based on the Query Result + +```shell +# Automatically generate an .out file for the sql_action test case based on the query result. Ignore this if the .out file already exists. +./run-regression-test.sh --run sql_action -genOut + +# Automatically generate an .out file for the sql_action test case based on the query result. If an .out file already exist, it will be overwritten. +./run-regression-test.sh --run sql_action -forceGenOut +``` + +## Suite Plug-in + +Sometimes there might be a need to expand the suite classes, but it is not easy to change the source code of suite classes. In this case, you might do that by plug-ins. The default directory for plug-ins is `${DORIS_HOME}/regression-test/plugins`, in whic you can define how to expand suite classes with a groovy script. For example, this is how to add a `testPlugin` function for log printing to `plugin_example.groovy`: + +```groovy +import org.apache.doris.regression.suite.Suite + +// Register `testPlugin` function to Suite, +// and invoke in ${DORIS_HOME}/regression-test/suites/demo/test_plugin.groovy +Suite.metaClass.testPlugin = { String info /* param */ -> + + // Which suite invoke current function? + Suite suite = delegate as Suite + + // Function body + suite.getLogger().info("Test plugin: suiteName: ${suite.name}, info: ${info}".toString()) + + // Optional return value + return "OK" +} + +logger.info("Added 'testPlugin' function to Suite") +``` + +After adding the `testPlugin` function, you can use it in a regular test case. For example, in `${DORIS_HOME}/regression-test/suites/demo/test_plugin.groovy`: + +```groovy +suite("test_plugin", "demo") { + // register testPlugin function in ${DORIS_HOME}/regression-test/plugins/plugin_example.groovy + def result = testPlugin("message from suite") + assertEquals("OK", result) +} +``` + +## CI/CD Support + +### TeamCity + +You can use TeamCity to recognize Service Message via stdout. If you start the regression test framework using the `--teamcity` parameter, the TeamCity Service Message will be printed in stdout. TeamCity will automatically read the event logs in stdout, and show `Tests` in the current pipeline, in which there will be the test and logs. Therefore, you only need to configure the following command to start the regression test framework. In the following snippet, `-Dteamcity.enableStdErr=fal [...] + +```shell +JAVA_OPTS="-Dteamcity.enableStdErr=${enableStdErr}" ./run-regression-test.sh --teamcity --run +``` + +## E2E Test with External Data Sources + +Doris supports queries on external data sources, so the regression testing framework allows users to build external data sources using Docker Compose, so that they can run end-to-end tests with external data sources. + +0. Preparation + + To begin with, modify `CONTAINER_UID` in `docker/thirdparties/custom_settings.env`. For example, `doris-10002-18sda1-`. The follow-up startup scripts will replace the corresponding names in Docker Compose to ensure consistency across multiple containers environment. + +1. Start the Container + + So far, Doris has supported Docker Compose for data sources including Elasticsearch, MySQL, PostgreSQL, Hive, SQLServer, Oracle, Iceberg, Hudi, and Trino. The relevant files can be found in the directory `docker/thirdparties/docker-compose`. + + By default, you can use the following command to start the Docker containers for all external data sources: + + (Note that for Hive and Hudi containers, you also need to download the pre-built data files. See the relevant documentation of Hive and Hudi.) + + ``` + cd docker/thirdparties && sh run-thirdparties-docker.sh + ``` + + Executing this command requires root or sudo privilege. If the command returns, that means all containers are started. You can check on them by inputing a `docker ps -a` command. + + To stop all containers, you can use the following command: + + ``` + cd docker/thirdparties && sh run-thirdparties-docker.sh --stop + ``` + + To start or stop some specific components, you can use the following commands: + + ``` + cd docker/thirdparties + # Start MySQL + sh run-thirdparties-docker.sh -c mysql + # Start MySQL, PostgreSQL, Iceberg + sh run-thirdparties-docker.sh -c mysql,pg,iceberg + # Stop MySQL, PostgreSQL, Iceberg + sh run-thirdparties-docker.sh -c mysql,pg,iceberg --stop + ``` + + 1. MySQL + + You can find the MySQL-related Docker Compose files in `docker/thirdparties/docker-compose/mysql` . + + * `mysql-5.7.yaml.tpl`: Docker Compose file template, no modification required. The default username and password is root/123456 + * `mysql-5.7.env`: Configuration file, in which you can configure the external port of the MySQL container (default number: 3316). + * `init/`: SQL files in this directory will be automatically executed once the container is created. + * `data/`: The container will be mounted to this local data directory after it is started. The `run-thirdparties-docker.sh` script will clear and rebuild this directory every time it is started. + + 2. PostgreSQL + + You can find the PostgreSQL-related Docker Compose files in `docker/thirdparties/docker-compose/postgresql` . + + * `postgresql-14.yaml.tpl`: Docker Compose file template, no modification required. The default username and password is postgres/123456 + * `postgresql-14.env`: Configuration file, in which you can configure the external port of the PostgreSQL container (default number: 5442). + * `init/`: SQL files in this directory will be automatically executed once the container is created. By default, that includes the creation of database, table, and some data input. + * `data/`: The container will be mounted to this local data directory after it is started. The `run-thirdparties-docker.sh` script will clear and rebuild this directory every time it is started. + + 3. Hive + + You can find the Hive-related Docker Compose files in `docker/thirdparties/docker-compose/hive` . + + * `hive-2x.yaml.tpl`: Docker Compose file template, no modification required. + + * `hadoop-hive.env.tpl`: The configuration file template, no modification required. + + * `gen_env.sh`: The script for initializing the configuration file. In this script, you can modify the two external ports: `FS_PORT` for defaultFs and `HMS_PORT` for Hive metastore (default numbers: 8120 and 9183, respectively). This script will be called once `run-thirdparties-docker.sh` is started. + + * The `scripts/` directory will be mounted to the container once it is started. Files in this directory require no modifications. Note that you need to download the pre-built files before you start the container: + + Download files from `https://doris-build-hk-1308700295.cos.ap-hongkong.myqcloud.com/regression/load/tpch1_parquet/tpch1.db.tar.gz` to the `scripts/` directory and decompress. + + 4. Elasticsearch + + You can find three Docker images (for Elasticsearch 6, Elasticsearch 7, and Elasticsearch 8) in `docker/thirdparties/docker-compose/elasticsearch/` . + + * `es.yaml.tpl`: Docker Compose file template, three versions included, no modifications required. + * `es.env`: Configuration file, in which you need to configure the port number for Elasticsearch. + * `scripts` : In this directory, you can find the initialization script after the image is started. + + 5. Oracle + + You can find the image for Oracle 11 in `docker/thirdparties/docker-compose/oracle/`. + + * `oracle-11.yaml.tpl`: Docker Compose file template, no modifications required. + * `oracle-11.env`: Configure the external port for Oracle (default number: 1521). + + 6. SQLServer + + You can find the image for SQLServer 2022 in `docker/thirdparties/docker-compose/sqlserver/` . + + * `sqlserver.yaml.tpl`: Docker Compose file template, no modifications required. + * `sqlserver.env`: Configure the external port for SQLServer (default number: 1433). + + 7. ClickHouse + + You can find the image for ClickHouse 22 in `docker/thirdparties/docker-compose/clickhouse/`. + + * `clickhouse.yaml.tpl`: Docker Compose file template, no modifications required. + * `clickhouse.env`: Configure the external port for ClickHouse (default number: 8123). + + 8. Iceberg + + You can find the image of Iceberg + Spark + Minio in `docker/thirdparties/docker-compose/iceberg/`. + + * `iceberg.yaml.tpl`: Docker Compose file template, no modifications required. + * `entrypoint.sh.tpl`: The template of the initialization script after the image is started, no modifications required. + * `spark-defaults.conf.tpl`: Configuration file template for Spark, no modifications required. + * `iceberg.env`: Configuration file for external ports. You need to modify every external port to avoid conflicts. + + After the image is started, execute the following command to start spark-sql: + + `docker exec -it doris-xx-spark-iceberg spark-sql` + + In this command, `doris-xx-spark-iceberg` is the container name. + + Execution examples for spark-sql iceberg: + + ``` + create database db1; + show databases; + create table db1.test1(k1 bigint, k2 bigint, k3 string) partitioned by (k1); + insert into db1.test1 values(1,2,'abc'); + select * from db1.test1; + quit; + ``` + + You can also access by spark-shell: + + ``` + docker exec -it doris-xx-spark-iceberg spark-shell + + spark.sql(s"create database db1") + spark.sql(s"show databases").show() + spark.sql(s"create table db1.test1(k1 bigint, k2 bigint, k3 string) partitioned by (k1)").show() + spark.sql(s"show tables from db1").show() + spark.sql(s"insert into db1.test1 values(1,2,'abc')").show() + spark.sql(s"select * from db1.test1").show() + :q + ``` + + For more usage guide, see [Tabular Documentation](https://tabular.io/blog/docker-spark-and-iceberg/). + + 9. Hudi + + You can find the Hudi-related Docker Compose file in `docker/thirdparties/docker-compose/hudi`. + + * `hudi.yaml.tpl`: Docker Compose file template, no modifications required. + + * `hadoop.env`: Configuration file template, no modifications required. + + * The `scripts/` directory will be mounted to the container once it is started. Files in this directory require no modifications. Note that you need to download the pre-built files before you start the container: + + Download files from `https://doris-build-hk-1308700295.cos.ap-hongkong.myqcloud.com/regression/load/hudi/hudi_docker_compose_attached_file.zip` to the `scripts/` directory and decompress. + + * Before starting, add the following configurations to `/etc/hosts` to avoid `UnknownHostException`: + + ``` + 127.0.0.1 adhoc-1 + 127.0.0.1 adhoc-2 + 127.0.0.1 namenode + 127.0.0.1 datanode1 + 127.0.0.1 hiveserver + 127.0.0.1 hivemetastore + 127.0.0.1 sparkmaster + ``` + + After starting, you can execute the following command to start a Hive query: + + ``` + docker exec -it adhoc-2 /bin/bash + + beeline -u jdbc:hive2://hiveserver:10000 \ + --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \ + --hiveconf hive.stats.autogather=false + + show tables; + show partitions stock_ticks_mor_rt; + select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'; + select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'; + exit; + ``` + + You can also access by spark-shell: + + ``` + docker exec -it adhoc-1 /bin/bash + + $SPARK_INSTALL/bin/spark-shell \ + --jars /var/scripts/hudi_docker_compose_attached_file/jar/hoodie-hive-sync-bundle.jar \ + --master local[2] \ + --driver-class-path $HADOOP_CONF_DIR \ + --conf spark.sql.hive.convertMetastoreParquet=false \ + --deploy-mode client \ + --driver-memory 1G \ + --executor-memory 3G \ + --num-executors 1 + + spark.sql("show tables").show(100, false) + spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false) + spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG'").show(100, false) + spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false) + spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false) + spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG'").show(100, false) + :q + ``` + + For more usage guide, see [Hudi Documentation](https://hudi.apache.org/docs/docker_demo). + + 10. Trino + You can find the Trino-related Docker Compose file in `docker/thirdparties/docker-compose/trino`. + + Template files: + + * gen_env.sh.tpl: This is used to generate HDFS-related port numbers, no modifications required, but you can change the port numbers in the case of port conflicts. + + * hive.properties.tpl: This is used to configure the Trino catalog, no modifications required. + + * trino_hive.env.tpl: Environment configurations of Hive, no modifications required. + + * trino_hive.yaml.tpl: Docker Compose file, no modifications required. + + After the Trino Docker is started, a Trino + Hive Catalog will be configured, and then Trino will have two catalogs. + + 1. Hive + 2. TPCH (self-contained in Trino Docker) + + For more usage guide, see [Trino Documentation](https://trino.io/docs/current/installation/containers.html). + +2. Run the regression test + + Regression test for external data sources is disabled by default. You can enable it by modifying the following configurations in `regression-test/conf/regression-conf.groovy` : + + * `enableJdbcTest`: This is to enable test for JDBC external tables. For this purpose, you need to start the MySQL and PostgreSQL containers. + * `mysql_57_port` and `pg_14_port` are the external port of MySQL and PostgreSQL, respectively. Default port numbers: 3316 and 5442. + * `enableHiveTest`: This is to enable test for Hive external tables. For this purpose, you need to start the Hive container. + * `hms_port` is the external port for Hive metastore. Default number: 9183. + * `enableEsTest`: This is to enable test for Elasticsearch external tables. For this purpose, you need to start the Elasticsearch container. + * `es_6_port`: Port for Elasticsearch 6. + * `es_7_port`: Port for Elasticsearch 7. + * `es_8_port`: Port for Elasticsearch 8. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org