[doris] branch master updated: [docs](docs)Update spark_load.md (#22428)

zykkk Fri, 04 Aug 2023 04:23:20 -0700

This is an automated email from the ASF dual-hosted git repository.

zykkk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git



The following commit(s) were added to refs/heads/master by this push:
     new 89fff98ced [docs](docs)Update spark_load.md (#22428)
89fff98ced is described below

commit 89fff98ced631a681408b0114df45ae9390b7595
Author: Hu Yanjun <100749531+httpshir...@users.noreply.github.com>
AuthorDate: Fri Aug 4 19:22:52 2023 +0800

    [docs](docs)Update spark_load.md (#22428)
---
 docs/en/community/design/spark_load.md             | 220 +++++-
 .../developer-guide/regression-testing.md          | 836 ++++++++++++++++++++-
 2 files changed, 1053 insertions(+), 3 deletions(-)

diff --git a/docs/en/community/design/spark_load.md 
b/docs/en/community/design/spark_load.md
index 9e630f4393..7e5b9b155a 100644
--- a/docs/en/community/design/spark_load.md
+++ b/docs/en/community/design/spark_load.md
@@ -24,6 +24,222 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Spark load
+# Implementation Design for Spark Load
 
-TODO
+## Background
+
+Doris supports various data ingestion methods, including Broker Load, Routine 
Load, Stream Load, and Mini Batch Load.
+Spark Load is the optimal choice for first-time data ingestion into Doris. It 
is for moving large volumes of data and can speed up data ingestion.
+
+## Basic Concepts
+
+* FE: Frontend, the frontend nodes of Doris. They process request from 
clients, manage metadata and clusters, and generate query execution plans.
+* BE: Backend, the backend nodes of Doris. They store and manage data, and 
execute query plans.
+* Tablet:  A table in Doris is divided into tablets.
+* DPP: Data preprocessing. To transfer, clean, partition, sort, and aggregate 
data using external computing resources (Hadoop, Spark)
+
+## Design
+
+### Objective
+
+Doris itself is relatively slow in batch import of 100GB data or larger. As it 
hasn't realized read/writing separation. Such batch import can lead to high CPU 
usage. For users who need to migrate large data volumes, Doris provides the 
Spark Load option. It leverages the data loading and concurrency capabilities 
of Spark clusters to perform ETL, sorting, and aggregation upon data ingestion. 
In this way, users can migrate huge datasets faster and less costly.
+
+The Spark Load method must be compatible with the multiple deployment modes of 
Spark. The first step is to support YARN clusters. Meanwhile, considering the 
variety of data formats, it needs to support multiple kinds of data files, 
including CSV, Parquet, and ORC.  
+
+### Implementation Plans
+
+When designing the implementation of Spark Load, we take into account the our 
existing data loading framework.
+
+#### Plan A
+
+For the ease of development, we hope to reuse the existing data loading 
framework as much as possible, so this is the overall plan for implementing 
Spark Load:
+
+The Spark Load statement input by the user will be parsed to produce LoadStmt. 
An identifier field `isSparkLoad` will be added to the LoadStmt. If it is true, 
the system will create a SparkLoadJob  (similar to how a BrokerLoadJob is 
created). The job will be executed by a state machine mechanism. In PENDING 
stage, a SparkLoadPendingTask will be created; in LOADING stage, a 
LoadLoadingTask will be created for data loading. Doris reuses its existing 
data loading framework and executes the  [...]
+
+A few considerations for this plan:
+
+##### Syntax
+
+The Spark Load statement is similar to that of Broker Load. This is an example:
+
+```
+               LOAD LABEL example_db.label1
+        (
+        DATA INFILE("hdfs://hdfs_host:hdfs_port/user/palo/data/input/file")
+               NEGATIVE
+        INTO TABLE `my_table`
+               PARTITION (p1, p2)
+               COLUMNS TERMINATED BY ","
+               columns(k1,k2,k3,v1,v2)
+               set (
+                       v3 = v1 + v2,
+                       k4 = hll_hash(k2)
+               )
+               where k1 > 20
+        )
+               with spark.cluster_name
+        PROPERTIES
+        (
+        "spark.master" = "yarn",
+               "spark.executor.cores" = "5",
+               "spark.executor.memory" = "10g",
+               "yarn.resourcemanager.address" = "xxx.tc:8032",
+        "max_filter_ratio" = "0.1",
+        );
+```
+
+`spark.cluster_name` is the name of the Spark cluster to be loaded. It can be 
set via SET PROPERTY and users may refer to the configurations of the original 
Hadoop cluster. The Spark cluster configurations in the PROPERTIES will 
overwrite those in `spark.cluster_name`.
+
+Explanation of the properties:
+
+- spark.master: Spark cluster deployment mode. Supported modes include 
yarn/standalone/local/k8s. We will prioritize the support for yarn and 
implement the yarn-cluster mode. (The yarn-client mode is mainly used in 
interactive scenarios.)
+- spark.executor.cores: the number of CPUs in the executor
+- spark.executor.memory: the memory size of the executor
+- yarn.resourcemanager.address: address of the resourcemanager of the 
specified YARN
+- max_filter_ratio: the upper limit of the filter ratio
+
+##### SparkLoadJob
+
+After a user sends a Spark Load statement, the statement will be parsed and a 
SparkLoadJob will be created.
+
+```
+SparkLoadJob:
+         +-------+-------+
+         |    PENDING    |-----------------|
+         +-------+-------+                 |
+                                | SparkLoadPendingTask    |
+                 v                         |
+         +-------+-------+                 |
+         |    LOADING    |-----------------|
+         +-------+-------+                 |
+                                | LoadLoadingTask         |
+                 v                         |
+         +-------+-------+                 |
+         |  COMMITTED    |-----------------|
+         +-------+-------+                 |
+                                |                         |
+                 v                         v  
+         +-------+-------+         +-------+-------+     
+         |   FINISHED    |         |   CANCELLED   |
+         +-------+-------+         +-------+-------+
+                                |                         Λ
+                 +-------------------------+
+```
+
+The above is the execution process of a SparkLoadJob.
+
+##### SparkLoadPendingTask
+
+A SparkLoadPendingTask is to commit a Spark ETL job to the Spark cluster. 
Since Spark supports various deployment modes (localhost, standalone, yarn, 
k8s), we need an abstraction of a generic interface for Spark ETL. The main 
tasks include:
+
+- Commit a Spark ETL job
+- Cancel a Spark ETL job
+- Obtain status of the Spark ETL job
+
+A prototype of the interface:
+
+```
+class SparkEtlJob {
+       // Commit a Spark ETL job
+       // Return JobId
+       String submitJob(TBrokerScanRangeParams params);
+
+       // Cancel a Spark ETL job
+       bool cancelJob(String jobId);
+
+       // Obtain status of the Spark ETL job
+       JobStatus getJobStatus(String jobId);
+private:
+       std::list<DataDescription> data_descriptions;
+};
+```
+
+We hope to support various sub-classes to support different cluster deployment 
modes. And we hope to implement SparkEtlJobForYarn to support Spark Load for 
YARN clusters. The JobId in the above snippet is the aphid of the YARN cluster. 
One way to obtain the appid is to commit a Spark Job via the spark-submit 
client, analyze the standard error output, and retrieve the appid by text 
matching.
+
+However, based on lessons drawn from Hadoop DPP jobs, we need to take care of 
job queuing issues: due to data volume or cluster queue reasons, the number of 
concurrent loading jobs might hit the limit and subsequent job commits might 
fail. One solution is to separately set a limit on the number of concurrent 
Spark Load jobs and a per-user concurrency limit, so as to avoid avoid 
interference between jobs of different users. 
+
+
+
+Additionally, implementing a job scheduling system that prioritizes jobs based 
on their importance and allocates resources accordingly can ensure that 
critical jobs are processed first and that resources are used efficiently.
+
+
+
+Key steps of executing a Spark ETL job include:
+
+1. Type conversion（extraction/transformation）
+
+   Transfer the fields in the source file into column types (check if the 
fields are legal for function computation, etc.)
+
+2. Function computation (transformation), including negative computation
+
+   Complete computation of the specified function. Function list: "strftime", 
"time_format", "alignment_timestamp", "default_value", "md5sum", 
"replace_value", "now", "hll_hash", "substitute".
+
+3.  "columns from path" extraction
+
+4. "where" filtering
+
+5. Data partitioning and bucketing
+
+6. Data sorting and pre-aggregation
+
+   
+
+   The OlapTableSink includes data sorting and aggregation, so logically we 
don't need to implement them separately, **but we decide to include them in the 
Spark ETL job in order to  improve BE efficiency by skipping these steps during 
loading in BE.** We might change that based on future test performance.  
+
+   
+
+   Another nut to crack to support global dictionary for bitmap since the 
bitmap columns of string type require global dictionary.
+
+   
+
+   To indicate whether the data sorting and aggregation have been completed, 
we plan to generate a a job.json description file upon job completion, which 
includes the following properties:
+
+   ```
+   {
+       "is_segment_file" : "false",
+       "is_sort" : "true",
+       "is_agg" : "true",
+   }
+   ```
+
+   Explanation: 
+       is_sort: if the data is sorted
+       is_agg: if the data is aggregated
+       is_segment_file: if the generated files are segment files
+
+7. Since the computation of Rollup is based on the base table, to optimize the 
generation of Rollups, we need to look into the index hierarchy.
+
+The tricky part is to support expression calculation for columns.
+
+Lastly, for the storage format of the output file after a Spark Load job is 
finished, we plan to support CSV, Parquet, and ORC, and make Parquet the 
default option for higher storage efficiency.
+
+##### LoadLoadingTask
+
+For the LoadLoadingTask, we can reuse our existing logic. What's different 
from BrokerLoadJob is, LoadLoadingTask doesn't need to include column mapping, 
function, computation, negative import, filtering, or aggregation, because they 
have all been done in SparkEtlTask. A LoadLoadingTask can just include simple 
column mapping and type conversion.
+
+##### Data loading to BE
+
+For this part, we can fully reuse the existing framework.
+
+#### Plan B
+
+Plan A can reuse our data loading framework to a great extend and realize 
support for large data import as soon as possible. But the downside is, data 
processed by the Spark ETL job is already partitioned into tablets, but our 
existing Broker Load framework will still partition and bucket the data, 
serialize it, and then send  data to the target BE nodes via RPC. The process 
generates extra overheads. 
+
+So we have a second plan. Plan B is to generate Segment files of the Doris 
storage format after the SparkEtlJob. And then the three replicas should 
complete file loading via the add_rowset interface under a clone-like 
mechanism. The difference between Plan B and Plan A is that, in Plan B, you 
need to:
+
+1. Add a tabletid suffix to the generated files
+2. Add an interface to SparkLoadPendingTask class: Map<long, Pair<String, 
Long>> getFilePathMap(). The interface returns the mapping relations between 
tabletid and files.
+3. Add a "spark_push" interface to BE RPC service, in order to pull the 
transformed files after ETL to local machines (read via Broker) and then load 
data via the "add_rowset" interface using clone-like logic.
+4. Generate a new loading task: SparkLoadLoadingTask. The task is to read 
job.json files, parse the properties, and use the properties as RPC parameters. 
Then it calls the "spark_push" interface and sends a data loading request to 
the BE where the tablet is. The "spark_push" interface in BE will decide what 
to do based on the "is_segment_file" parameter. If it is true, it will download 
the segment file and add rowset; if it is false, it will follow the pusher 
logic and execute data loading.
+
+Plan B also moves the segment file generation workload to the Spark cluster, 
which can greatly reduce the burden on the Doris cluster and thus improve 
efficiency. However, in Plan B, the underlying rowset and the segment v2 
interface will be packed into individual SO files, and we need to use Spark to 
call that interface to transfer data into segment files. 
+
+ relies on packaging the underlying rowset and segment v2 interfaces into 
independent shared object (SO) files, and using Spark to call these interfaces 
to transform data into segment files.
+
+## Conclusion
+
+As we compare two plans, Plan A allows for the least development work but the 
BE will undertake unnecessary workloads, while for Plan B, we can make use of 
our existing Hadoop data loading framework. Therefore, we plan to implement 
Spark Load in two steps.
+
+Step 1: Go with Plan B and complete data partitioning, sorting, and 
aggregation in Spark, and generate Parquet files. Then we follow the Hadoop 
pusher process and transfer data formats in the BE.
+
+Step 2: Encapsulate the database of segment writing, generate Doris formats 
directly, and add an RPC interface to implement clone-like data loading logic.
diff --git a/docs/en/community/developer-guide/regression-testing.md 
b/docs/en/community/developer-guide/regression-testing.md
index c5981fd3ca..46acaba921 100644
--- a/docs/en/community/developer-guide/regression-testing.md
+++ b/docs/en/community/developer-guide/regression-testing.md
@@ -27,4 +27,838 @@ under the License.
 
 # Regression Testing
 
-TODO
\ No newline at end of file
+## Basic Concepts
+
+1. `Suite`: A test case (the file name)
+2. `Group`: A test set (the directory that the test case belongs to)
+3. `Action`: An encapsulated test action, such as `sql_action`  for SQL 
execution, `test_action`  for result verification, and `streamLoad_action`  for 
data ingestion.
+
+## Steps
+
+1. Get the clusters ready
+2. Modify the configuration files 
`${DORIS_HOME}/regression-test/conf/regression-conf.groovy`, set items, such as 
JDBC URL and user
+3. Create the test case files and write the test cases
+4. If a test case includes a `qt` Action, you also need to create the relevant 
data files. For example, the case `suites/demo/qt_action.groovy` will require a 
TSV file `data/demo/qt_action.out` for output verification.
+5. Run `${DORIS_HOME}/run-regression-test.sh` to test all cases, or run 
`${DORIS_HOME}/run-regression-test.sh --run <suiteName>`  to test a few cases. 
For more examples, please refer to the "Startup Script Examples" section on 
this page.
+
+## Directory Structure
+
+Key files and directories to pay attention to:
+
+1. `run-regression-test.sh`: Startup script
+2. `regression-conf.groovy`: Default configurations for regression test
+3. `data`: where the the input and output data is
+4. `suites`: where test cases are
+
+```
+./${DORIS_HOME}
+    |-- **run-regression-test.sh**           Startup script for regression test
+    |-- regression-test
+    |   |-- plugins                          Plug-ins
+    |   |-- conf
+    |   |   |-- logback.xml                  Log configurations
+    |   |   |-- **regression-conf.groovy**   Default configurations
+    |   |
+    |   |-- framework                        framework source code for 
regression test
+    |   |-- **data**                         Input and output data files of 
test cases
+    |   |   |-- demo                         Input and output data files of 
demo
+    |   |   |-- correctness                  Input and output data files of 
correctness test cases
+    |   |   |-- performance                  Input and output data files of 
performance test cases
+    |   |   |-- utils                        Input and output data files of 
other utilities
+    |   |
+    |   |-- **suites**                       Test cases for regression testing
+    |       |-- demo                         Demo of test cases
+    |       |-- correctness                  Test cases of correctness tests
+    |       |-- performance                  Test cases of performance tests
+    |       |-- utils                        Other utilities
+    |
+    |-- output
+        |-- regression-test
+            |-- log                          Logs of regression testing
+```
+
+
+## Default Configurations of the Framework
+
+Modify the JDBC and FE configurations based on the case.
+
+```groovy
+/* ============ Key parts to pay attention to ============ */
+// The default database, which will be created if the user does not create a 
database
+defaultDb = "regression_test"
+
+// JDBC configurations
+jdbcUrl = "jdbc:mysql://127.0.0.1:9030/?"
+jdbcUser = "root"
+jdbcPassword = ""
+
+// FE address configurations, used for Stream Load
+feHttpAddress = "127.0.0.1:8030"
+feHttpUser = "root"
+feHttpPassword = ""
+
+/* ============ Configurations that usually do not require modifications 
============ */
+
+// DORIS_HOME is loaded by run-regression-test.sh
+// which is java -DDORIS_HOME=./
+
+// Set the path for the test cases
+suitePath = "${DORIS_HOME}/regression-test/suites"
+// Set the path for the input and output data
+dataPath = "${DORIS_HOME}/regression-test/data"
+// Set the path for the plug-ins
+pluginPath = "${DORIS_HOME}/regression-test/plugins"
+
+// By default, all groups will be read. The groups are separated by comma. For 
example, "demo,performance".
+// This is dynamically specified and overwritten via run-regression-test.sh 
--run -g
+testGroups = ""
+// By default, all suites will be read. This is dynamically specified and 
overwritten via run-regression-test.sh --run -s
+testSuites = ""
+// Directories specified in this parameter will be loaded by default. It can 
be dynamically specified and overwritten via run-regression-test.sh --run -d
+testDirectories = ""
+
+// Groups specified in this parameter will be excluded. It can be dynamically 
specified and overwritten via run-regression-test.sh --run -xg
+excludeGroups = ""
+// Suites specified in this parameter will be excluded. It can be dynamically 
specified and overwritten via run-regression-test.sh --run -xs
+excludeSuites = ""
+// Directories specified in this parameter will be excluded. It can be 
dynamically specified and overwritten via run-regression-test.sh --run -xd
+excludeDirectories = ""
+
+// Other self-defined configurations
+customConf1 = "test_custom_conf_value"
+```
+
+## Steps to Write Test Cases
+
+1. Enter the `${DORIS_HOME}/regression-test` directory
+2. Choose a directory based on the test. For correctness test, place the test 
in `suites/correctness`; for performance test, put the test case in 
`suites/performance`.
+3. Create a groovy test case file, and add a few `Action` for testing. 
+
+## Action
+
+`Action` refers to a test action, which is defined by DSL and provided in the 
test framework.
+
+### sql action
+
+A `sql_action` is used to commit SQL and obtain results. If the query fails, 
an error will be thrown.
+
+Parameters:
+
+- String sql: the input SQL string
+- `return List<List<Object>>`: the query result. If it is DDL/DML, a result of 
one row and one column will be returned and the value will be updateRowCount.
+
+You can find the following sample code in 
`${DORIS_HOME}/regression-test/suites/demo/sql_action.groovy`: 
+
+```groovy
+suite("sql_action", "demo") {
+    // execute sql and ignore result
+    sql "show databases"
+
+    // execute sql and get result, outer List denote rows, inner List denote 
columns in a single row
+    List<List<Object>> tables = sql "show tables"
+
+    // assertXxx() will invoke junit5's Assertions.assertXxx() dynamically
+    assertTrue(tables.size() >= 0) // test rowCount >= 0
+
+    // syntax error
+    try {
+        sql "a b c d e"
+        throw new IllegalStateException("Should be syntax error")
+    } catch (java.sql.SQLException t) {
+        assertTrue(true)
+    }
+
+    def testTable = "test_sql_action1"
+
+    try {
+        sql "DROP TABLE IF EXISTS ${testTable}"
+
+        // multi-line sql
+        def result1 = sql """
+                        CREATE TABLE IF NOT EXISTS ${testTable} (
+                            id int
+                        )
+                        DISTRIBUTED BY HASH(id) BUCKETS 1
+                        PROPERTIES (
+                          "replication_num" = "1"
+                        ) 
+                        """
+
+        // DDL/DML return 1 row and 1 column, the only value is update row 
count
+        assertTrue(result1.size() == 1)
+        assertTrue(result1[0].size() == 1)
+        assertTrue(result1[0][0] == 0, "Create table should update 0 rows")
+
+        def result2 = sql "INSERT INTO test_sql_action1 values(1), (2), (3)"
+        assertTrue(result2.size() == 1)
+        assertTrue(result2[0].size() == 1)
+        assertTrue(result2[0][0] == 3, "Insert should update 3 rows")
+    } finally {
+        /**
+         * try_xxx(args) means:
+         *
+         * try {
+         *    return xxx(args)
+         * } catch (Throwable t) {
+         *     // do nothing
+         *     return null
+         * }
+         */
+        try_sql("DROP TABLE IF EXISTS ${testTable}")
+
+        // you can see the error sql will not throw exception and return
+        try {
+            def errorSqlResult = try_sql("a b c d e f g")
+            assertTrue(errorSqlResult == null)
+        } catch (Throwable t) {
+            assertTrue(false, "Never catch exception")
+        }
+    }
+
+    // order_sql(sqlStr) equals to sql(sqlStr, isOrder=true)
+    // sort result by string dict
+    def list = order_sql """
+                select 2
+                union all
+                select 1
+                union all
+                select null
+                union all
+                select 15
+                union all
+                select 3
+                """
+
+    assertEquals(null, list[0][0])
+    assertEquals(1, list[1][0])
+    assertEquals(15, list[2][0])
+    assertEquals(2, list[3][0])
+    assertEquals(3, list[4][0])
+}
+```
+
+### qt action
+
+A `qt_action` is used to commit SQL and verify the output results based on the 
`.out` TSV file.
+
+- String sql: the input SQL string
+- return void
+
+You can find the following sample code in 
`${DORIS_HOME}/regression-test/suites/demo/qt_action.groovy`:
+
+```groovy
+suite("qt_action", "demo") {
+    /**
+     * qt_xxx sql equals to quickTest(xxx, sql) in which xxx is tag.
+     * The result will be compared to the relevant file: 
${DORIS_HOME}/regression_test/data/qt_action.out.
+     *
+     * If you want to generate .out tsv file, you can run -genOut or 
-forceGenOut.
+     * e.g
+     *   ${DORIS_HOME}/run-regression-test.sh --run qt_action -genOut
+     *   ${DORIS_HOME}/run-regression-test.sh --run qt_action -forceGenOut
+     */
+    qt_select "select 1, 'beijing' union all select 2, 'shanghai'"
+
+    qt_select2 "select 2"
+
+    // Order result by string dict then compare to .out file.
+    // order_qt_xxx sql equals to quickTest(xxx, sql, true).
+    order_qt_union_all  """
+                select 2
+                union all
+                select 1
+                union all
+                select null
+                union all
+                select 15
+                union all
+                select 3
+                """
+}
+```
+
+### test action
+
+A `test_action` is to test with more complicated verification rules, such as 
to verify the number of rows, execution time, and the existence of errors
+
+Parameters:
+
+- String sql: the input SQL string
+- `List<List<Object>> result`: provide a list to check if the output is equal 
to that list
+- `Iterator<Object> resultIterator`: provide an iterator to check if the 
output is equal to that iterator
+- String resultFile: provide a file URI (a relative path of a local file or a 
http(s) path) to check if the output is equal to the http response stream. The 
format is similar to that of the `.out` file, but there is no header or comment.
+- String exception: check if the error thrown includes certain strings
+- long rowNum: check the number of rows
+- long time: check if the execution time is shorter than this value, which is 
measured in millisecond
+- `Closure<List<List<Object>>, Throwable, Long, Long> check`: self-defined 
callback validation function, in which you can input the query result, error 
thrown, and response time. When there is a callback validation function, other 
verification methods will fail.
+
+You can find the following sample code in 
`${DORIS_HOME}/regression-test/suites/demo/test_action.groovy`:
+
+```groovy
+suite("test_action", "demo") {
+    test {
+        sql "abcdefg"
+        // check exception message contains
+        exception "errCode = 2, detailMessage = Syntax error"
+    }
+
+    test {
+        sql """
+            select *
+            from (
+                select 1 id
+                union all
+                select 2
+            ) a
+            order by id"""
+
+        // multi check condition
+
+        // check return 2 rows
+        rowNum 2
+        // execute time must <= 5000 millisecond
+        time 5000
+        // check result, must be 2 rows and 1 column, the first row is 1, 
second is 2
+        result(
+            [[1], [2]]
+        )
+    }
+
+    test {
+        sql "a b c d e f g"
+
+        // other check will not work because already declared a check callback
+        exception "aaaaaaaaa"
+
+        // callback
+        check { result, exception, startTime, endTime ->
+            // assertXxx() will invoke junit5's Assertions.assertXxx() 
dynamically
+            assertTrue(exception != null)
+        }
+    }
+
+    test {
+        sql  """
+                select 2
+                union all
+                select 1
+                union all
+                select null
+                union all
+                select 15
+                union all
+                select 3
+                """
+
+        check { result, ex, startTime, endTime ->
+            // same as order_sql(sqlStr)
+            result = sortRows(result)
+
+            assertEquals(null, result[0][0])
+            assertEquals(1, result[1][0])
+            assertEquals(15, result[2][0])
+            assertEquals(2, result[3][0])
+            assertEquals(3, result[4][0])
+        }
+    }
+
+    // execute sql and order query result, then compare to iterator
+    def selectValues = [1, 2, 3, 4]
+    test {
+        order true
+        sql selectUnionAll(selectValues)
+        resultIterator(selectValues.iterator())
+    }
+
+    // compare to data/demo/test_action.csv
+    test {
+        order true
+        sql selectUnionAll(selectValues)
+
+        // you can set to http://xxx or https://xxx
+        // and compare to http response body
+        resultFile "test_action.csv"
+    }
+}
+```
+
+### explain action
+
+An `explain_action` is used to check if the returned string contains certain 
strings.
+
+Parameters:
+
+- String sql: the SQL that needs to be explained
+- String contains: check if certain strings are included. You can check for 
multiple strings at a time.
+- String notContains: check if certain strings are not included. You can check 
for multiple strings at a time.
+- `Closure<String> check`: self-defined callback validation function, by which 
you can obtain the returned string. When there is a callback validation 
function, other verification methods will fail.
+- `Closure<String, Throwable, Long, Long> check`: self-defined callback 
validation function, by which you can obtain the error thrown and response time
+
+You can find the following sample code in 
`${DORIS_HOME}/regression-test/suites/demo/explain_action.groovy`:
+
+```groovy
+suite("explain_action", "demo") {
+    explain {
+        sql("select 100")
+
+        // contains("OUTPUT EXPRS:<slot 0> 100\n") && contains("PARTITION: 
UNPARTITIONED\n")
+        contains "OUTPUT EXPRS:<slot 0> 100\n"
+        contains "PARTITION: UNPARTITIONED\n"
+    }
+
+    explain {
+        sql("select 100")
+
+        // contains(" 100\n") && !contains("abcdefg") && !("1234567")
+        contains " 100\n"
+        notContains "abcdefg"
+        notContains "1234567"
+    }
+
+    explain {
+        sql("select 100")
+        // simple callback
+        check { explainStr -> explainStr.contains("abcdefg") || 
explainStr.contains(" 100\n") }
+    }
+
+    explain {
+        sql("a b c d e")
+        // callback with exception and time
+        check { explainStr, exception, startTime, endTime ->
+            // assertXxx() will invoke junit5's Assertions.assertXxx() 
dynamically
+            assertTrue(exception != null)
+        }
+    }
+}
+```
+
+### streamLoad action
+
+A `streamLoad_action` is used to ingest data.
+
+Parameters:
+
+- String db: database, set to the defaultDb in regression-conf.groovy
+- String table: table name
+- String file: the path of the file to be loaded. It can be a relative path 
under the data directory, or a HTTP URL.
+- `Iterator<List<Object>> inputIterator`: the iterator to be loaded
+- String inputText: the text to be loaded (rarely used)
+- InputStream inputStream: the stream to be loaded (rarely used)
+- long time: check if the exection time is shorter than this value, which is 
measure in millisecond
+- void set(String key, String value): set the header of the HTTP request for 
Stream Load, such as label and columnSeparator.
+- `Closure<String, Throwable, Long, Long> check`: self-defined callback 
validation function, by which you can obtain the returned result, error thrown, 
and response time. When there is a callback validation function, other 
verification methods will fail.
+
+You can find the following sample code in 
`${DORIS_HOME}/regression-test/suites/demo/streamLoad_action.groovy`:
+
+```groovy
+suite("streamLoad_action", "demo") {
+
+    def tableName = "test_streamload_action1"
+
+    sql """
+            CREATE TABLE IF NOT EXISTS ${tableName} (
+                id int,
+                name varchar(255)
+            )
+            DISTRIBUTED BY HASH(id) BUCKETS 1
+            PROPERTIES (
+              "replication_num" = "1"
+            ) 
+        """
+
+    streamLoad {
+        // you can skip declare db, because a default db already specify in 
${DORIS_HOME}/conf/regression-conf.groovy
+        // db 'regression_test'
+        table tableName
+
+        // default label is UUID:
+        // set 'label' UUID.randomUUID().toString()
+
+        // default column_separator is specify in doris fe config, usually is 
'\t'.
+        // this line change to ','
+        set 'column_separator', ','
+
+        // relate to 
${DORIS_HOME}/regression-test/data/demo/streamload_input.csv.
+        // also, you can stream load a http stream, e.g. http://xxx/some.csv
+        file 'streamload_input.csv'
+
+        time 10000 // limit inflight 10s
+
+        // stream load action will check result, include Success status, and 
NumberTotalRows == NumberLoadedRows
+    }
+
+
+    // stream load 100 rows
+    def rowCount = 100
+    // range: [0, rowCount)
+    // or rangeClosed: [0, rowCount]
+    def rowIt = range(0, rowCount)
+            .mapToObj({i -> [i, "a_" + i]}) // change Long to List<Long, 
String>
+            .iterator()
+
+    streamLoad {
+        table tableName
+        // also, you can upload a memory iterator
+        inputIterator rowIt
+
+        // if declared a check callback, the default check condition will 
ignore.
+        // So you must check all condition
+        check { result, exception, startTime, endTime ->
+            if (exception != null) {
+                throw exception
+            }
+            log.info("Stream load result: ${result}".toString())
+            def json = parseJson(result)
+            assertEquals("success", json.Status.toLowerCase())
+            assertEquals(json.NumberTotalRows, json.NumberLoadedRows)
+            assertTrue(json.NumberLoadedRows > 0 && json.LoadBytes > 0)
+        }
+    }
+}
+```
+
+### Other Actions
+
+Other actions include thread, lazyCheck, events, connect, and selectUnionAll. 
You can find examples for them in `${DORIS_HOME}/regression-test/suites/demo`.
+
+## Startup Script Examples
+
+```shell
+# View parameter descriptions for scripts
+./run-regression-test.sh h
+
+# View parameter descriptions for framework
+./run-regression-test.sh --run -h
+
+# Test all cases
+./run-regression-test.sh 
+
+# Delete test framework compilation results and test logs
+./run-regression-test.sh --clean
+
+# Test the suite named "sql_action". Currently, the test case file suffix is 
the same as the suiteName, so the corresponding test case file in this case is 
sql_action.groovy.
+./run-regression-test.sh --run sql_action 
+
+# Test the suite whose name contains 'sql'. **Remember to use single quotation 
marks.**
+./run-regression-test.sh --run '*sql*' 
+
+# Test the "demo" and "perfomance" group
+./run-regression-test.sh --run -g 'demo,performance'
+
+# Test the suite "sql_action" in the "demo" group
+./run-regression-test.sh --run -g demo -s sql_action
+
+# Test the suite "sql_action" in the demo directory
+./run-regression-test.sh --run -d demo -s sql_action
+
+# Test the suites under the demo directory except the "sql_action" suite
+./run-regression-test.sh --run -d demo -xs sql_action
+
+# Exclude the test cases in the demo directory
+./run-regression-test.sh --run -xd demo
+
+# Exclude the test cases in the "demo" group
+./run-regression-test.sh --run -xg demo
+
+# Self-defined configurations
+./run-regression-test.sh --run -conf a=b
+
+# Parallel execution
+./run-regression-test.sh --run -parallel 5 -suiteParallel 10 -actionParallel 20
+```
+
+## Generate `.out` File Based on the Query Result
+
+```shell
+# Automatically generate an .out file for the sql_action test case based on 
the query result. Ignore this if the .out file already exists.
+./run-regression-test.sh --run sql_action -genOut
+
+# Automatically generate an .out file for the sql_action test case based on 
the query result. If an .out file already exist, it will be overwritten.
+./run-regression-test.sh --run sql_action -forceGenOut
+```
+
+## Suite Plug-in
+
+Sometimes there might be a need to expand the suite classes, but it is not 
easy to change the source code of suite classes. In this case, you might do 
that by plug-ins. The default directory for plug-ins is 
`${DORIS_HOME}/regression-test/plugins`, in whic you can define how to expand 
suite classes with a groovy script. For example, this is how to add a 
`testPlugin` function for log printing to `plugin_example.groovy`:
+
+```groovy
+import org.apache.doris.regression.suite.Suite
+
+// Register `testPlugin` function to Suite,
+// and invoke in ${DORIS_HOME}/regression-test/suites/demo/test_plugin.groovy
+Suite.metaClass.testPlugin = { String info /* param */ ->
+
+    // Which suite invoke current function?
+    Suite suite = delegate as Suite
+
+    // Function body
+    suite.getLogger().info("Test plugin: suiteName: ${suite.name}, info: 
${info}".toString())
+
+    // Optional return value
+    return "OK"
+}
+
+logger.info("Added 'testPlugin' function to Suite")
+```
+
+After adding the `testPlugin` function, you can use it in a regular test case. 
For example, in `${DORIS_HOME}/regression-test/suites/demo/test_plugin.groovy`:
+
+```groovy
+suite("test_plugin", "demo") {
+    // register testPlugin function in 
${DORIS_HOME}/regression-test/plugins/plugin_example.groovy
+    def result = testPlugin("message from suite")
+    assertEquals("OK", result)
+}
+```
+
+## CI/CD Support
+
+### TeamCity
+
+You can use TeamCity to recognize Service Message via stdout. If you start the 
regression test framework using the `--teamcity` parameter, the TeamCity 
Service Message will be printed in stdout. TeamCity will automatically read the 
event logs in stdout, and show `Tests` in the current pipeline, in which there 
will be the test and logs. Therefore, you only need to configure the following 
command to start the regression test framework. In the following snippet, 
`-Dteamcity.enableStdErr=fal [...]
+
+```shell
+JAVA_OPTS="-Dteamcity.enableStdErr=${enableStdErr}" ./run-regression-test.sh 
--teamcity --run
+```
+
+## E2E Test with External Data Sources
+
+Doris supports queries on external data sources, so the regression testing 
framework allows users to build external data sources using Docker Compose, so 
that they can run end-to-end tests with external data sources. 
+
+0. Preparation
+
+   To begin with, modify `CONTAINER_UID` in 
`docker/thirdparties/custom_settings.env`. For example, `doris-10002-18sda1-`. 
The follow-up startup scripts will replace the corresponding names in Docker 
Compose to ensure consistency across multiple containers environment.
+
+1. Start the Container
+
+   So far, Doris has supported Docker Compose for data sources including 
Elasticsearch, MySQL, PostgreSQL, Hive, SQLServer, Oracle, Iceberg, Hudi, and 
Trino. The relevant files can be found in the directory 
`docker/thirdparties/docker-compose`.
+
+   By default, you can use the following command to start the Docker 
containers for all external data sources:
+
+   (Note that for Hive and Hudi containers, you also need to download the 
pre-built data files. See the relevant documentation of Hive and Hudi.)
+
+   ```
+   cd docker/thirdparties && sh run-thirdparties-docker.sh
+   ```
+
+   Executing this command requires root or sudo privilege. If the command 
returns, that means all containers are started. You can check on them by 
inputing a `docker ps -a` command. 
+
+   To stop all containers, you can use the following command:
+
+   ```
+   cd docker/thirdparties && sh run-thirdparties-docker.sh --stop
+   ```
+
+   To start or stop some specific components, you can use the following 
commands:
+
+   ```
+   cd docker/thirdparties
+   # Start MySQL
+   sh run-thirdparties-docker.sh -c mysql
+   # Start MySQL, PostgreSQL, Iceberg
+   sh run-thirdparties-docker.sh -c mysql,pg,iceberg
+   # Stop MySQL, PostgreSQL, Iceberg
+   sh run-thirdparties-docker.sh -c mysql,pg,iceberg --stop
+   ```
+
+   1. MySQL
+
+      You can find the MySQL-related Docker Compose files in 
`docker/thirdparties/docker-compose/mysql` .
+
+      * `mysql-5.7.yaml.tpl`: Docker Compose file template, no modification 
required. The default username and password is root/123456
+      * `mysql-5.7.env`: Configuration file, in which you can configure the 
external port of the MySQL container (default number: 3316).
+      * `init/`: SQL files in this directory will be automatically executed 
once the container is created.
+      * `data/`: The container will be mounted to this local data directory 
after it is started. The `run-thirdparties-docker.sh`  script will clear and 
rebuild this directory every time it is started.
+
+   2. PostgreSQL
+
+      You can find the PostgreSQL-related Docker Compose files in  
`docker/thirdparties/docker-compose/postgresql` .
+
+      * `postgresql-14.yaml.tpl`: Docker Compose file template, no 
modification required. The default username and password is postgres/123456
+      * `postgresql-14.env`: Configuration file, in which you can configure 
the external port of the PostgreSQL container (default number: 5442).
+      * `init/`: SQL files in this directory will be automatically executed 
once the container is created. By default, that includes the creation of 
database, table, and some data input.
+      * `data/`: The container will be mounted to this local data directory 
after it is started. The `run-thirdparties-docker.sh`  script will clear and 
rebuild this directory every time it is started.
+
+   3. Hive
+
+      You can find the Hive-related Docker Compose files in 
`docker/thirdparties/docker-compose/hive` .
+
+      * `hive-2x.yaml.tpl`: Docker Compose file template, no modification 
required.
+
+      * `hadoop-hive.env.tpl`: The configuration file template, no 
modification required.
+
+      * `gen_env.sh`: The script for initializing the configuration file. In 
this script, you can modify the two external ports: `FS_PORT` for defaultFs and 
`HMS_PORT` for Hive metastore (default numbers: 8120 and 9183, respectively). 
This script will be called once `run-thirdparties-docker.sh`  is started.
+
+      * The `scripts/` directory will be mounted to the container once it is 
started. Files in this directory require no modifications. Note that you need 
to download the pre-built files before you start the container: 
+
+        Download files from  
`https://doris-build-hk-1308700295.cos.ap-hongkong.myqcloud.com/regression/load/tpch1_parquet/tpch1.db.tar.gz`
  to the `scripts/` directory and decompress.
+
+   4. Elasticsearch
+
+      You can find three Docker images (for Elasticsearch 6, Elasticsearch 7, 
and Elasticsearch 8) in `docker/thirdparties/docker-compose/elasticsearch/` .
+
+      * `es.yaml.tpl`: Docker Compose file template, three versions included, 
no modifications required.
+      * `es.env`: Configuration file, in which you need to configure the port 
number for Elasticsearch. 
+      * `scripts` : In this directory, you can find the initialization script 
after the image is started.
+
+   5. Oracle
+
+      You can find the image for Oracle 11 in 
`docker/thirdparties/docker-compose/oracle/`.
+
+      * `oracle-11.yaml.tpl`: Docker Compose file template, no modifications 
required. 
+      * `oracle-11.env`: Configure the external port for Oracle (default 
number: 1521).
+
+   6. SQLServer
+
+      You can find the image for SQLServer 2022 in 
`docker/thirdparties/docker-compose/sqlserver/` .
+
+      * `sqlserver.yaml.tpl`: Docker Compose file template, no modifications 
required. 
+      * `sqlserver.env`: Configure the external port for SQLServer (default 
number: 1433).
+
+   7. ClickHouse
+
+      You can find the image for ClickHouse 22 in 
`docker/thirdparties/docker-compose/clickhouse/`.
+
+       * `clickhouse.yaml.tpl`: Docker Compose file template, no modifications 
required.
+       * `clickhouse.env`: Configure the external port for ClickHouse (default 
number: 8123).
+
+   8. Iceberg
+
+      You can find the image of Iceberg + Spark + Minio in 
`docker/thirdparties/docker-compose/iceberg/`.
+
+      * `iceberg.yaml.tpl`: Docker Compose file template, no modifications 
required.
+      * `entrypoint.sh.tpl`: The template of the initialization script after 
the image is started, no modifications required.
+      * `spark-defaults.conf.tpl`: Configuration file template for Spark, no 
modifications required.
+      * `iceberg.env`: Configuration file for external ports. You need to 
modify every external port to avoid conflicts.
+
+      After the image is started, execute the following command to start 
spark-sql:
+
+      `docker exec -it doris-xx-spark-iceberg spark-sql`        
+
+      In this command, `doris-xx-spark-iceberg` is the container name.
+
+      Execution examples for spark-sql iceberg:
+
+      ```
+      create database db1;
+      show databases;
+      create table db1.test1(k1 bigint, k2 bigint, k3 string) partitioned by 
(k1);
+      insert into db1.test1 values(1,2,'abc');
+      select * from db1.test1;
+      quit;
+      ```
+
+      You can also access by spark-shell:
+
+      ```
+      docker exec -it doris-xx-spark-iceberg spark-shell
+      
+      spark.sql(s"create database db1")
+      spark.sql(s"show databases").show()
+      spark.sql(s"create table db1.test1(k1 bigint, k2 bigint, k3 string) 
partitioned by (k1)").show()
+      spark.sql(s"show tables from db1").show()
+      spark.sql(s"insert into db1.test1 values(1,2,'abc')").show()
+      spark.sql(s"select * from db1.test1").show()
+      :q
+      ```
+
+      For more usage guide, see  [Tabular 
Documentation](https://tabular.io/blog/docker-spark-and-iceberg/).
+
+   9. Hudi
+
+      You can find the Hudi-related Docker Compose file in 
`docker/thirdparties/docker-compose/hudi`.
+
+      * `hudi.yaml.tpl`: Docker Compose file template, no modifications 
required.
+
+      * `hadoop.env`: Configuration file template, no modifications required.
+
+      * The `scripts/` directory will be mounted to the container once it is 
started. Files in this directory require no modifications. Note that you need 
to download the pre-built files before you start the container: 
+
+        Download files from 
`https://doris-build-hk-1308700295.cos.ap-hongkong.myqcloud.com/regression/load/hudi/hudi_docker_compose_attached_file.zip`
  to the `scripts/` directory and decompress.
+
+      * Before starting, add the following configurations to `/etc/hosts` to 
avoid `UnknownHostException`:
+
+      ```
+      127.0.0.1 adhoc-1
+      127.0.0.1 adhoc-2
+      127.0.0.1 namenode
+      127.0.0.1 datanode1
+      127.0.0.1 hiveserver
+      127.0.0.1 hivemetastore
+      127.0.0.1 sparkmaster
+      ```
+
+      After starting, you can execute the following command to start a Hive 
query:
+
+      ```
+      docker exec -it adhoc-2 /bin/bash
+      
+      beeline -u jdbc:hive2://hiveserver:10000 \
+      --hiveconf 
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
+      --hiveconf hive.stats.autogather=false
+      
+      show tables;
+      show partitions stock_ticks_mor_rt;
+      select symbol, max(ts) from stock_ticks_cow group by symbol HAVING 
symbol = 'GOOG';
+      select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING 
symbol = 'GOOG';
+      exit;
+      ```
+
+      You can also access by spark-shell:
+
+      ```
+      docker exec -it adhoc-1 /bin/bash
+      
+      $SPARK_INSTALL/bin/spark-shell \
+        --jars 
/var/scripts/hudi_docker_compose_attached_file/jar/hoodie-hive-sync-bundle.jar \
+        --master local[2] \
+        --driver-class-path $HADOOP_CONF_DIR \
+        --conf spark.sql.hive.convertMetastoreParquet=false \
+        --deploy-mode client \
+        --driver-memory 1G \
+        --executor-memory 3G \
+        --num-executors 1
+      
+      spark.sql("show tables").show(100, false)
+      spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol 
HAVING symbol = 'GOOG'").show(100, false)
+      spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close 
 from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
+      spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by 
symbol HAVING symbol = 'GOOG'").show(100, false)
+      spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by 
symbol HAVING symbol = 'GOOG'").show(100, false)
+      spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close 
 from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
+      :q
+      ```
+
+      For more usage guide, see [Hudi 
Documentation](https://hudi.apache.org/docs/docker_demo).
+
+   10. Trino
+       You can find the Trino-related Docker Compose file in 
`docker/thirdparties/docker-compose/trino`.
+
+       Template files:
+
+       * gen_env.sh.tpl: This is used to generate HDFS-related port numbers, 
no modifications required, but you can change the port numbers in the case of 
port conflicts.
+
+       * hive.properties.tpl: This is used to configure the Trino catalog, no 
modifications required. 
+
+       * trino_hive.env.tpl: Environment configurations of Hive, no 
modifications required. 
+
+       * trino_hive.yaml.tpl: Docker Compose file, no modifications required. 
+
+         After the Trino Docker is started, a Trino + Hive Catalog will be 
configured, and then Trino will have two catalogs.
+
+       1. Hive
+       2. TPCH (self-contained in Trino Docker)
+
+       For more usage guide, see  [Trino 
Documentation](https://trino.io/docs/current/installation/containers.html).
+
+2. Run the regression test
+
+   Regression test for external data sources is disabled by default. You can 
enable it by modifying the following configurations in  
`regression-test/conf/regression-conf.groovy` :
+
+   * `enableJdbcTest`: This is to enable test for JDBC external tables. For 
this purpose, you need to start the MySQL and PostgreSQL containers.
+   * `mysql_57_port` and `pg_14_port` are the external port of MySQL and 
PostgreSQL, respectively. Default port numbers: 3316 and 5442.
+   * `enableHiveTest`: This is to enable test for Hive external tables. For 
this purpose, you need to start the Hive container.
+   * `hms_port` is the external port for Hive metastore. Default number: 9183.
+   * `enableEsTest`: This is to enable test for Elasticsearch external tables. 
For this purpose, you need to start the Elasticsearch container.
+   * `es_6_port`: Port for Elasticsearch 6.
+   * `es_7_port`: Port for Elasticsearch 7.
+   * `es_8_port`: Port for Elasticsearch 8.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[doris] branch master updated: [docs](docs)Update spark_load.md (#22428)

Reply via email to