This is an automated email from the ASF dual-hosted git repository. zjffdu pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/zeppelin.git
The following commit(s) were added to refs/heads/master by this push: new b09d57e [ZEPPELIN-4839]. Update flink interpreter doc b09d57e is described below commit b09d57e99f4cbdc9d1a1d71f14b53998424b82fc Author: Jeff Zhang <zjf...@apache.org> AuthorDate: Thu May 28 10:39:25 2020 +0800 [ZEPPELIN-4839]. Update flink interpreter doc ### What is this PR for? This PR update the flink interpreter doc for the recently improvement in flink interpreter and add some screenshot. ### What type of PR is it? [Documentation] ### Todos * [ ] - Task ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-4839 ### How should this be tested? * CI pass ### Screenshots (if appropriate) ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: Jeff Zhang <zjf...@apache.org> Closes #3779 from zjffdu/ZEPPELIN-4839 and squashes the following commits: ddc774078 [Jeff Zhang] [ZEPPELIN-4839]. Update flink interpreter doc --- .../zeppelin/img/docs-img/flink_append_mode.gif | Bin 0 -> 294307 bytes .../zeppelin/img/docs-img/flink_single_mode.gif | Bin 0 -> 58198 bytes .../zeppelin/img/docs-img/flink_update_mode.gif | Bin 0 -> 131055 bytes .../zeppelin/img/docs-img/flink_z_batch_table.png | Bin 0 -> 189710 bytes .../zeppelin/img/docs-img/flink_z_dataset.png | Bin 0 -> 160627 bytes .../zeppelin/img/docs-img/flink_z_stream_table.gif | Bin 0 -> 226356 bytes docs/interpreter/flink.md | 174 ++++++++++++++++----- 7 files changed, 137 insertions(+), 37 deletions(-) diff --git a/docs/assets/themes/zeppelin/img/docs-img/flink_append_mode.gif b/docs/assets/themes/zeppelin/img/docs-img/flink_append_mode.gif new file mode 100644 index 0000000..3c827f4 Binary files /dev/null and b/docs/assets/themes/zeppelin/img/docs-img/flink_append_mode.gif differ diff --git a/docs/assets/themes/zeppelin/img/docs-img/flink_single_mode.gif b/docs/assets/themes/zeppelin/img/docs-img/flink_single_mode.gif new file mode 100644 index 0000000..91b49ed Binary files /dev/null and b/docs/assets/themes/zeppelin/img/docs-img/flink_single_mode.gif differ diff --git a/docs/assets/themes/zeppelin/img/docs-img/flink_update_mode.gif b/docs/assets/themes/zeppelin/img/docs-img/flink_update_mode.gif new file mode 100644 index 0000000..fe7e2e9 Binary files /dev/null and b/docs/assets/themes/zeppelin/img/docs-img/flink_update_mode.gif differ diff --git a/docs/assets/themes/zeppelin/img/docs-img/flink_z_batch_table.png b/docs/assets/themes/zeppelin/img/docs-img/flink_z_batch_table.png new file mode 100644 index 0000000..216675e Binary files /dev/null and b/docs/assets/themes/zeppelin/img/docs-img/flink_z_batch_table.png differ diff --git a/docs/assets/themes/zeppelin/img/docs-img/flink_z_dataset.png b/docs/assets/themes/zeppelin/img/docs-img/flink_z_dataset.png new file mode 100644 index 0000000..052b74d Binary files /dev/null and b/docs/assets/themes/zeppelin/img/docs-img/flink_z_dataset.png differ diff --git a/docs/assets/themes/zeppelin/img/docs-img/flink_z_stream_table.gif b/docs/assets/themes/zeppelin/img/docs-img/flink_z_stream_table.gif new file mode 100644 index 0000000..fb52e99 Binary files /dev/null and b/docs/assets/themes/zeppelin/img/docs-img/flink_z_stream_table.gif differ diff --git a/docs/interpreter/flink.md b/docs/interpreter/flink.md index 40cf058..108d7b5 100644 --- a/docs/interpreter/flink.md +++ b/docs/interpreter/flink.md @@ -26,7 +26,7 @@ limitations under the License. ## Overview [Apache Flink](https://flink.apache.org) is an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimization. -In Zeppelin 0.9, we refactor the Flink interpreter in Zeppelin to support the latest version of Flink. **Only Flink 1.10+ is supported, old version of flink may not work.** +In Zeppelin 0.9, we refactor the Flink interpreter in Zeppelin to support the latest version of Flink. **Only Flink 1.10+ is supported, old version of flink won't work.** Apache Flink is supported in Zeppelin with Flink interpreter group which consists of below five interpreters. <table class="table-configuration"> @@ -65,11 +65,10 @@ Apache Flink is supported in Zeppelin with Flink interpreter group which consist ## Prerequisites * Download Flink 1.10 for scala 2.11 (Only scala-2.11 is supported, scala-2.12 is not supported yet in Zeppelin) -* Download [flink-hadoop-shaded](https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-hadoop-2/2.8.3-10.0/flink-shaded-hadoop-2-2.8.3-10.0.jar) and put it under lib folder of flink (flink interpreter need that to support yarn mode) ## Configuration The Flink interpreter can be configured with properties provided by Zeppelin (as following table). -You can also set other flink properties which are not listed in the table. For a list of additional properties, refer to [Flink Available Properties](https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html). +You can also add and set other flink properties which are not listed in the table. For a list of additional properties, refer to [Flink Available Properties](https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html). <table class="table-configuration"> <tr> <th>Property</th> @@ -191,11 +190,7 @@ You can also set other flink properties which are not listed in the table. For a <td>true</td> <td>Whether display scala shell output in colorful format</td> </tr> - <tr> - <td>zeppelin.flink.enableHive</td> - <td>false</td> - <td>Whether enable hive</td> - </tr> + <tr> <td>zeppelin.flink.enableHive</td> <td>false</td> @@ -249,6 +244,16 @@ And will create 6 variables as pyflink (`%flink.pyflink` or `%flink.ipyflink`) e * `st_env_2` (StreamTableEnvironment for flink planner) * `bt_env_2` (BatchTableEnvironment for flink planner) +## Blink/Flink Planner + +There're 2 planners supported by Flink's table api: `flink` & `blink`. + +* If you want to use DataSet api, and convert it to flink table then please use flink planner (`btenv_2` and `stenv_2`). +* In other cases, we would always recommend you to use `blink` planner. This is also what flink batch/streaming sql interpreter use (`%flink.bsql` & `%flink.ssql`) + +Check this [page](https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/common.html#main-differences-between-the-two-planners) for the difference between flink planner and blink planner. + + ## Execution mode (Local/Remote/Yarn) Flink in Zeppelin supports 3 execution modes (`flink.execution.mode`): @@ -274,15 +279,7 @@ In order to run flink in Yarn mode, you need to make the following settings: * Set `flink.execution.mode` to `yarn` * Set `HADOOP_CONF_DIR` in flink's interpreter setting. -* Make sure `hadoop` command is your PATH. Because internally flink will call command `hadoop classpath` and load all the hadoop related jars in the flink interpreter process - - -## Blink/Flink Planner - -There're 2 planners supported by Flink's table api: `flink` & `blink`. - -* If you want to use DataSet api, and convert it to flink table then please use flink planner (`btenv_2` and `stenv_2`). -* In other cases, we would always recommend you to use `blink` planner. This is also what flink batch/streaming sql interpreter use (`%flink.bsql` & `%flink.ssql`) +* Make sure `hadoop` command is on your PATH. Because internally flink will call command `hadoop classpath` and load all the hadoop related jars in the flink interpreter process ## How to use Hive @@ -291,31 +288,63 @@ In order to use Hive in Flink, you have to make the following setting. * Set `zeppelin.flink.enableHive` to be true * Set `zeppelin.flink.hive.version` to be the hive version you are using. -* Set `HIVE_CONF_DIR` to be the location where `hive-site.xml` is located. Make sure hive metastore is started and you have configure `hive.metastore.uris` in `hive-site.xml` +* Set `HIVE_CONF_DIR` to be the location where `hive-site.xml` is located. Make sure hive metastore is started and you have configured `hive.metastore.uris` in `hive-site.xml` * Copy the following dependencies to the lib folder of flink installation. * flink-connector-hive_2.11–1.10.0.jar * flink-hadoop-compatibility_2.11–1.10.0.jar * hive-exec-2.x.jar (for hive 1.x, you need to copy hive-exec-1.x.jar, hive-metastore-1.x.jar, libfb303–0.9.2.jar and libthrift-0.9.2.jar) -After these settings, you will be able to query hive table via either table api `%flink` or batch sql `%flink.bsql` ## Flink Batch SQL -`%flink.bsql` is used for flink's batch sql. You just type `help` to get all the available commands. +`%flink.bsql` is used for flink's batch sql. You can type `help` to get all the available commands. +It supports all the flink sql, including DML/DDL/DQL. * Use `insert into` statement for batch ETL -* Use `select` statement for exploratory data analytics +* Use `select` statement for batch data analytics ## Flink Streaming SQL -`%flink.ssql` is used for flink's streaming sql. You just type `help` to get all the available commands. Mainlly there're 2 cases: +`%flink.ssql` is used for flink's streaming sql. You just type `help` to get all the available commands. +It supports all the flink sql, including DML/DDL/DQL. -* Use `insert into` statement for streaming processing +* Use `insert into` statement for streaming ETL * Use `select` statement for streaming data analytics +## Streaming Data Visualization + +Zeppelin supports 3 types of streaming data analytics: +* Single +* Update +* Append + +### type=single +Single mode is for the case when the result of sql statement is always one row, such as the following example. The output format is HTML, +and you can specify paragraph local property `template` for the final output content template. +And you can use `{i}` as placeholder for the ith column of result. + + <center> +  + </center> + +### type=update +Update mode is suitable for the case when the output is more than one rows, and always will be updated continuously. +Here’s one example where we use group by. + + <center> +  + </center> + +### type=append +Append mode is suitable for the scenario where output data is always appended. E.g. the following example which use tumble window. + + <center> +  + </center> + ## Flink UDF -You can use Flink scala UDF or Python UDF in sql. UDF for batch and streaming sql is the same. Here's 2 examples. +You can use Flink scala UDF or Python UDF in sql. UDF for batch and streaming sql is the same. Here're 2 examples. * Scala UDF @@ -343,29 +372,100 @@ bt_env.register_function("python_upper", udf(PythonUpper(), DataTypes.STRING(), ``` -Besides defining udf in Zeppelin, you can also load udfs in jars via `flink.udf.jars`. For example, you can create -udfs in intellij and then build these udfs in one jar. After that you can specify `flink.udf.jars` to this jar, and flink +Zeppelin only supports scala and python for flink interpreter, if you want to write java udf or the udf is pretty complicated which make it not suitable to write in Zeppelin, +then you can write the udf in IDE and build a udf jar. +In Zeppelin you just need to specify `flink.udf.jars` to this jar, and flink interpreter will detect all the udfs in this jar and register all the udfs to TableEnvironment, the udf name is the class name. -## ZeppelinContext -Zeppelin automatically injects `ZeppelinContext` as variable `z` in your Scala/Python environment. `ZeppelinContext` provides some additional functions and utilities. -See [Zeppelin-Context](../usage/other_features/zeppelin_context.html) for more details. +## PyFlink(%flink.pyflink) +In order to use PyFlink in Zeppelin, you just need to do the following configuration. +* Install apache-flink (e.g. pip install apache-flink) +* Set `zeppelin.pyflink.python` to the python executable where apache-flink is installed in case you have multiple python installed. +* Copy flink-python_2.11–1.10.0.jar from flink opt folder to flink lib folder + +And PyFlink will create 6 variables for you: + +* `s_env` (StreamExecutionEnvironment), +* `b_env` (ExecutionEnvironment) +* `st_env` (StreamTableEnvironment for blink planner) +* `bt_env` (BatchTableEnvironment for blink planner) +* `st_env_2` (StreamTableEnvironment for flink planner) +* `bt_env_2` (BatchTableEnvironment for flink planner) -## IPython Support +### IPython Support(%flink.ipyflink) By default, zeppelin would use IPython in `%flink.pyflink` when IPython is available, Otherwise it would fall back to the original python implementation. For the IPython features, you can refer doc[Python Interpreter](python.html) -## Tutorial Notes +## ZeppelinContext +Zeppelin automatically injects `ZeppelinContext` as variable `z` in your Scala/Python environment. `ZeppelinContext` provides some additional functions and utilities. +See [Zeppelin-Context](../usage/other_features/zeppelin_context.html) for more details. You can use `z` to display both flink DataSet and batch/stream table. + +* Display DataSet + <center> +  + </center> + +* Display Batch Table + <center> +  + </center> +* Display Stream Table + <center> +  + </center> + +## Paragraph local properties + +In the section of `Streaming Data Visualization`, we demonstrate the different visualization type via paragraph local properties: `type`. +In this section, we will list and explain all the supported local properties in flink interpreter. -Zeppelin is shipped with several Flink tutorial notes which may be helpful for you. Except the first one, the below 4 notes cover the 4 main scenarios of flink. +<table class="table-configuration"> + <tr> + <th>Property</th> + <th>Default</th> + <th>Description</th> + </tr> + <tr> + <td>type</td> + <td></td> + <td>Used in %flink.ssql to specify the streaming visualization type (single, update, append)</td> + </tr> + <tr> + <td>refreshInterval</td> + <td>3000</td> + <td>Used in `%flink.ssql` to specify frontend refresh interval for streaming data visualization.</td> + </tr> + <tr> + <td>template</td> + <td>{0}</td> + <td>Used in `%flink.ssql` to specify html template for `single` type of streaming data visualization, And you can use `{i}` as placeholder for the {i}th column of the result.</td> + </tr> + <tr> + <td>parallelism</td> + <td></td> + <td>Used in %flink.ssql & %flink.bsql to specify the flink sql job parallelism</td> + </tr> + <tr> + <td>maxParallelism</td> + <td></td> + <td>Used in %flink.ssql & %flink.bsql to specify the flink sql job max parallelism in case you want to change parallelism later. For more details, refer this [link](https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/parallel.html#setting-the-maximum-parallelism) </td> + </tr> + <tr> + <td>savepointDir</td> + <td></td> + <td>If you specify it, then when you cancel your flink job in Zeppelin, it would also do savepoint and store state in this directory. And when you resume your job, it would resume from this savepoint.</td> + </tr> + <tr> + <td>runAsOne</td> + <td>false</td> + <td>All the insert into sql will run in a single flink job if this is true.</td> + </tr> +</table> -* Flink Basic -* Batch ETL -* Exploratory Data Analytics -* Streaming ETL -* Streaming Data Analytics +## Tutorial Notes +Zeppelin is shipped with several Flink tutorial notes which may be helpful for you. You check more features in the tutorial notes.