[zeppelin] branch master updated: [ZEPPELIN-5483] Update spark doc

zjffdu Mon, 16 Aug 2021 19:34:24 -0700

This is an automated email from the ASF dual-hosted git repository.

zjffdu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/zeppelin.git



The following commit(s) were added to refs/heads/master by this push:
     new 2276573  [ZEPPELIN-5483] Update spark doc
2276573 is described below

commit 227657365193793515bd64aa1678f45fe3502b0f
Author: Jeff Zhang <zjf...@apache.org>
AuthorDate: Mon Aug 9 11:53:55 2021 +0800

    [ZEPPELIN-5483] Update spark doc
    
    ### What is this PR for?
    
    * Update spark doc
    * Other docs like install.md are updated in this PR as well
    * Add one new tutorial note for how to run pyspark with customized python 
runtime in yarn
    
    ### What type of PR is it?
    [Documentation ]
    
    ### Todos
    * [ ] - Task
    
    ### What is the Jira issue?
    * https://issues.apache.org/jira/browse/ZEPPELIN-5483
    
    ### How should this be tested?
    * no tests needed
    
    ### Screenshots (if appropriate)
    
    ### Questions:
    * Does the licenses files need update? no
    * Is there breaking changes for older versions? no
    * Does this needs documentation? no
    
    Author: Jeff Zhang <zjf...@apache.org>
    
    Closes #4203 from zjffdu/ZEPPELIN-5483 and squashes the following commits:
    
    42b6213d47 [Jeff Zhang] [ZEPPELIN-5483] Update spark doc
---
 docs/README.md                                     |    2 +-
 docs/_includes/themes/zeppelin/_navigation.html    |    4 +-
 docs/interpreter/ksql.md                           |    6 +-
 docs/interpreter/spark.md                          |  304 +++--
 docs/quickstart/flink_with_zeppelin.md             |   42 +
 docs/quickstart/install.md                         |   30 +-
 docs/quickstart/python_with_zeppelin.md            |    5 +-
 docs/quickstart/r_with_zeppelin.md                 |   42 +
 docs/quickstart/spark_with_zeppelin.md             |   13 +-
 docs/quickstart/sql_with_zeppelin.md               |    2 +
 docs/setup/deployment/flink_and_spark_cluster.md   |    2 +
 docs/usage/interpreter/dependency_management.md    |    5 +-
 .... PySpark Conda Env in Yarn Mode_2GE79Y5FV.zpln | 1397 ++++++++++++++++++++
 13 files changed, 1738 insertions(+), 116 deletions(-)

diff --git a/docs/README.md b/docs/README.md
index 7ca822e..c9646e1 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -42,7 +42,7 @@ bundle exec jekyll serve --watch
 
 **Run locally using docker**
 ```
-docker run --rm -it \                                                  
+docker run --rm -it \
        -v $PWD:/docs \
        -w /docs \
        -p '4000:4000' \
diff --git a/docs/_includes/themes/zeppelin/_navigation.html 
b/docs/_includes/themes/zeppelin/_navigation.html
index 8bbf6b0..3459856 100644
--- a/docs/_includes/themes/zeppelin/_navigation.html
+++ b/docs/_includes/themes/zeppelin/_navigation.html
@@ -34,8 +34,10 @@
                 <li><a href="{{BASE_PATH}}/quickstart/yarn.html">Yarn</a></li>
                 <li role="separator" class="divider"></li>
                 <li><a 
href="{{BASE_PATH}}/quickstart/spark_with_zeppelin.html">Spark with 
Zeppelin</a></li>
+                <li><a 
href="{{BASE_PATH}}/quickstart/flink_with_zeppelin.html">Flink with 
Zeppelin</a></li>
                 <li><a 
href="{{BASE_PATH}}/quickstart/sql_with_zeppelin.html">SQL with 
Zeppelin</a></li>
                 <li><a 
href="{{BASE_PATH}}/quickstart/python_with_zeppelin.html">Python with 
Zeppelin</a></li>
+                <li><a href="{{BASE_PATH}}/quickstart/r_with_zeppelin.html">R 
with Zeppelin</a></li>
               </ul>
             </li>
 
@@ -131,6 +133,7 @@
                 <li><a 
href="{{BASE_PATH}}/usage/interpreter/overview.html">Overview</a></li>
                 <li role="separator" class="divider"></li>
                 <li><a 
href="{{BASE_PATH}}/interpreter/spark.html">Spark</a></li>
+                <li><a 
href="{{BASE_PATH}}/interpreter/flink.html">Flink</a></li>
                 <li><a href="{{BASE_PATH}}/interpreter/jdbc.html">JDBC</a></li>
                 <li><a 
href="{{BASE_PATH}}/interpreter/python.html">Python</a></li>
                 <li><a href="{{BASE_PATH}}/interpreter/r.html">R</a></li>
@@ -140,7 +143,6 @@
                 <li><a 
href="{{BASE_PATH}}/interpreter/bigquery.html">BigQuery</a></li>
                 <li><a 
href="{{BASE_PATH}}/interpreter/cassandra.html">Cassandra</a></li>
                 <li><a 
href="{{BASE_PATH}}/interpreter/elasticsearch.html">Elasticsearch</a></li>
-                <li><a 
href="{{BASE_PATH}}/interpreter/flink.html">Flink</a></li>
                 <li><a 
href="{{BASE_PATH}}/interpreter/geode.html">Geode</a></li>
                 <li><a 
href="{{BASE_PATH}}/interpreter/groovy.html">Groovy</a></li>
                 <li><a 
href="{{BASE_PATH}}/interpreter/hazelcastjet.html">Hazelcast Jet</a></li>
diff --git a/docs/interpreter/ksql.md b/docs/interpreter/ksql.md
index bc91ade..2a308be 100644
--- a/docs/interpreter/ksql.md
+++ b/docs/interpreter/ksql.md
@@ -57,7 +57,7 @@ Following some examples:
 PRINT 'orders';
 ```
 
-![PRINT image]({{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/ksql.1.gif)
+![PRINT image]({{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/ksql.1.png)
 
 ```
 %ksql
@@ -66,7 +66,7 @@ CREATE STREAM ORDERS WITH
    KAFKA_TOPIC ='orders');
 ```
 
-![CREATE image]({{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/ksql.1.gif)
+![CREATE image]({{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/ksql.2.png)
 
 ```
 %ksql
@@ -75,4 +75,4 @@ FROM ORDERS
 LIMIT 10
 ```
 
-![LIMIT image]({{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/ksql.3.gif)
\ No newline at end of file
+![LIMIT image]({{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/ksql.3.png)
\ No newline at end of file
diff --git a/docs/interpreter/spark.md b/docs/interpreter/spark.md
index fd0356d..12e0560 100644
--- a/docs/interpreter/spark.md
+++ b/docs/interpreter/spark.md
@@ -26,7 +26,7 @@ limitations under the License.
 ## Overview
 [Apache Spark](http://spark.apache.org) is a fast and general-purpose cluster 
computing system.
 It provides high-level APIs in Java, Scala, Python and R, and an optimized 
engine that supports general execution graphs.
-Apache Spark is supported in Zeppelin with Spark interpreter group which 
consists of below six interpreters.
+Apache Spark is supported in Zeppelin with Spark interpreter group which 
consists of following interpreters.
 
 <table class="table-configuration">
   <tr>
@@ -52,7 +52,17 @@ Apache Spark is supported in Zeppelin with Spark interpreter 
group which consist
   <tr>
     <td>%spark.r</td>
     <td>SparkRInterpreter</td>
-    <td>Provides an R environment with SparkR support</td>
+    <td>Provides an vanilla R environment with SparkR support</td>
+  </tr>
+  <tr>
+    <td>%spark.ir</td>
+    <td>SparkIRInterpreter</td>
+    <td>Provides an R environment with SparkR support based on Jupyter 
IRKernel</td>
+  </tr>
+  <tr>
+    <td>%spark.shiny</td>
+    <td>SparkShinyInterpreter</td>
+    <td>Used to create R shiny app with SparkR support</td>
   </tr>
   <tr>
     <td>%spark.sql</td>
@@ -66,6 +76,69 @@ Apache Spark is supported in Zeppelin with Spark interpreter 
group which consist
   </tr>
 </table>
 
+## Main Features
+
+<table class="table-configuration">
+  <tr>
+    <th>Feature</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>Support multiple versions of Spark</td>
+    <td>You can run different versions of Spark in one Zeppelin instance</td>
+  </tr>
+  <tr>
+    <td>Support multiple versions of Scala</td>
+    <td>You can run different Scala versions (2.10/2.11/2.12) of Spark in on 
Zeppelin instance</td>
+  </tr>
+  <tr>
+    <td>Support multiple languages</td>
+    <td>Scala, SQL, Python, R are supported, besides that you can also 
collaborate across languages, e.g. you can write Scala UDF and use it in 
PySpark</td>
+  </tr>
+  <tr>
+    <td>Support multiple execution modes</td>
+    <td>Local | Standalone | Yarn | K8s </td>
+  </tr>
+  <tr>
+    <td>Interactive development</td>
+    <td>Interactive development user experience increase your productivity</td>
+  </tr>
+
+  <tr>
+    <td>Inline Visualization</td>
+    <td>You can visualize Spark Dataset/DataFrame vis Python/R's plotting 
libraries, and even you can make SparkR Shiny app in Zeppelin</td>
+  </tr>
+
+  </tr>
+    <td>Multi-tenancy</td>
+    <td>Multiple user can work in one Zeppelin instance without affecting each 
other.</td>
+  </tr>
+
+  </tr>
+    <td>Rest API Support</td>
+    <td>You can not only submit Spark job via Zeppelin notebook UI, but also 
can do that via its rest api (You can use Zeppelin as Spark job server).</td>
+  </tr>
+</table>
+
+## Play Spark in Zeppelin docker
+
+For beginner, we would suggest you to play Spark in Zeppelin docker.
+In the Zeppelin docker image, we have already installed
+miniconda and lots of [useful python and R 
libraries](https://github.com/apache/zeppelin/blob/branch-0.10/scripts/docker/zeppelin/bin/env_python_3_with_R.yml)
+including IPython and IRkernel prerequisites, so `%spark.pyspark` would use 
IPython and `%spark.ir` is enabled.
+Without any extra configuration, you can run most of tutorial notes under 
folder `Spark Tutorial` directly.
+
+First you need to download Spark, because there's no Spark binary distribution 
shipped with Zeppelin.
+e.g. Here we download Spark 3.1.2 to`/mnt/disk1/spark-3.1.2`,
+and we mount it to Zeppelin docker container and run the following command to 
start Zeppelin docker container.
+
+```bash
+docker run -u $(id -u) -p 8080:8080 --rm -v /mnt/disk1/spark-3.1.2:/opt/spark 
-e SPARK_HOME=/opt/spark  --name zeppelin apache/zeppelin:0.10.0
+```
+
+After running the above command, you can open `http://localhost:8080` to play 
Spark in Zeppelin. We only verify the spark local mode in Zeppelin docker, 
other modes may not work due to network issues.
+
+
 ## Configuration
 The Spark interpreter can be configured with properties provided by Zeppelin.
 You can also set other Spark properties which are not listed in the table. For 
a list of additional properties, refer to [Spark Available 
Properties](http://spark.apache.org/docs/latest/configuration.html#available-properties).
@@ -201,14 +274,15 @@ You can also set other Spark properties which are not 
listed in the table. For a
     <td></td>
     <td>
       Overrides Spark UI default URL. Value should be a full URL (ex: 
http://{hostName}/{uniquePath}.
-      In Kubernetes mode, value can be Jinja template string with 3 template 
variables 'PORT', 'SERVICE_NAME' and 'SERVICE_DOMAIN'.
-      (ex: http://{{PORT}}-{{SERVICE_NAME}}.{{SERVICE_DOMAIN}})
+      In Kubernetes mode, value can be Jinja template string with 3 template 
variables PORT, {% raw %} SERVICE_NAME {% endraw %}  and  {% raw %} 
SERVICE_DOMAIN {% endraw %}.
+      (e.g.: {% raw %}http://{{PORT}}-{{SERVICE_NAME}}.{{SERVICE_DOMAIN}} {% 
endraw %}). In yarn mode, value could be a knox url with {% raw %} 
{{applicationId}} {% endraw %} as placeholder,
+      (e.g.: {% raw 
%}https://knox-server:8443/gateway/yarnui/yarn/proxy/{{applicationId}}/{% 
endraw %})
      </td>
   </tr>
   <tr>
     <td>spark.webui.yarn.useProxy</td>
     <td>false</td>
-    <td>whether use yarn proxy url as spark weburl, e.g. 
http://localhost:8088/proxy/application_1583396598068_0004</td>
+    <td>whether use yarn proxy url as Spark weburl, e.g. 
http://localhost:8088/proxy/application_1583396598068_0004</td>
   </tr>
   <tr>
     <td>spark.repl.target</td>
@@ -224,17 +298,21 @@ You can also set other Spark properties which are not 
listed in the table. For a
 
 Without any configuration, Spark interpreter works out of box in local mode. 
But if you want to connect to your Spark cluster, you'll need to follow below 
two simple steps.
 
-### Export SPARK_HOME
+* Set SPARK_HOME
+* Set master
+
+
+### Set SPARK_HOME
 
 There are several options for setting `SPARK_HOME`.
 
 * Set `SPARK_HOME` in `zeppelin-env.sh`
-* Set `SPARK_HOME` in Interpreter setting page
+* Set `SPARK_HOME` in interpreter setting page
 * Set `SPARK_HOME` via [inline generic 
configuration](../usage/interpreter/overview.html#inline-generic-confinterpreter)
 
 
-#### 1. Set `SPARK_HOME` in `zeppelin-env.sh`
+#### Set `SPARK_HOME` in `zeppelin-env.sh`
 
-If you work with only one version of spark, then you can set `SPARK_HOME` in 
`zeppelin-env.sh` because any setting in `zeppelin-env.sh` is globally applied.
+If you work with only one version of Spark, then you can set `SPARK_HOME` in 
`zeppelin-env.sh` because any setting in `zeppelin-env.sh` is globally applied.
 
 e.g. 
 
@@ -251,21 +329,21 @@ export HADOOP_CONF_DIR=/usr/lib/hadoop
 ```
 
 
-#### 2. Set `SPARK_HOME` in Interpreter setting page
+#### Set `SPARK_HOME` in interpreter setting page
 
-If you want to use multiple versions of spark, then you need create multiple 
spark interpreters and set `SPARK_HOME` for each of them. e.g.
-Create a new spark interpreter `spark24` for spark 2.4 and set `SPARK_HOME` in 
interpreter setting page
+If you want to use multiple versions of Spark, then you need to create 
multiple Spark interpreters and set `SPARK_HOME` separately. e.g.
+Create a new Spark interpreter `spark24` for Spark 2.4 and set its 
`SPARK_HOME` in interpreter setting page as following,
 <center>
 <img 
src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/spark_SPARK_HOME24.png" 
width="80%">
 </center>
 
-Create a new spark interpreter `spark16` for spark 1.6 and set `SPARK_HOME` in 
interpreter setting page
+Create a new Spark interpreter `spark16` for Spark 1.6 and set its 
`SPARK_HOME` in interpreter setting page as following,
 <center>
 <img 
src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/spark_SPARK_HOME16.png" 
width="80%">
 </center>
 
 
-#### 3. Set `SPARK_HOME` via [inline generic 
configuration](../usage/interpreter/overview.html#inline-generic-confinterpreter)
 
+#### Set `SPARK_HOME` via [inline generic 
configuration](../usage/interpreter/overview.html#inline-generic-confinterpreter)
 
 
 Besides setting `SPARK_HOME` in interpreter setting page, you can also use 
inline generic configuration to put the 
 configuration with code together for more flexibility. e.g.
@@ -273,23 +351,26 @@ configuration with code together for more flexibility. 
e.g.
 <img 
src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/spark_inline_configuration.png"
 width="80%">
 </center>
 
-### Set master in Interpreter menu
-After starting Zeppelin, go to **Interpreter** menu and edit **spark.master** 
property in your Spark interpreter setting. The value may vary depending on 
your Spark cluster deployment type.
+### Set master
+
+After setting `SPARK_HOME`, you need to set **spark.master** property in 
either interpreter setting page or inline configuartion. The value may vary 
depending on your Spark cluster deployment type.
 
 For example,
 
  * **local[*]** in local mode
  * **spark://master:7077** in standalone cluster
- * **yarn-client** in Yarn client mode  (Not supported in spark 3.x, refer 
below for how to configure yarn-client in Spark 3.x)
- * **yarn-cluster** in Yarn cluster mode  (Not supported in spark 3.x, refer 
below for how to configure yarn-client in Spark 3.x)
+ * **yarn-client** in Yarn client mode  (Not supported in Spark 3.x, refer 
below for how to configure yarn-client in Spark 3.x)
+ * **yarn-cluster** in Yarn cluster mode  (Not supported in Spark 3.x, refer 
below for how to configure yarn-cluster in Spark 3.x)
  * **mesos://host:5050** in Mesos cluster
 
 That's it. Zeppelin will work with any version of Spark and any deployment 
type without rebuilding Zeppelin in this way.
 For the further information about Spark & Zeppelin version compatibility, 
please refer to "Available Interpreters" section in [Zeppelin download 
page](https://zeppelin.apache.org/download.html).
 
-> Note that without exporting `SPARK_HOME`, it's running in local mode with 
included version of Spark. The included version may vary depending on the build 
profile.
+Note that without setting `SPARK_HOME`, it's running in local mode with 
included version of Spark. The included version may vary depending on the build 
profile. And this included version Spark has limited function, so it 
+is always recommended to set `SPARK_HOME`.
 
-> Yarn client mode and local mode will run driver in the same machine with 
zeppelin server, this would be dangerous for production. Because it may run out 
of memory when there's many spark interpreters running at the same time. So we 
suggest you only allow yarn-cluster mode via setting 
`zeppelin.spark.only_yarn_cluster` in `zeppelin-site.xml`.
+Yarn client mode and local mode will run driver in the same machine with 
zeppelin server, this would be dangerous for production. Because it may run out 
of memory when there's many Spark interpreters running at the same time. So we 
suggest you 
+only allow yarn-cluster mode via setting `zeppelin.spark.only_yarn_cluster` in 
`zeppelin-site.xml`.
 
 #### Configure yarn mode for Spark 3.x
 
@@ -314,77 +395,55 @@ Specifying `yarn-client` & `yarn-cluster` in 
`spark.master` is not supported in
 </table>
 
 
-## SparkContext, SQLContext, SparkSession, ZeppelinContext
+## Interpreter binding mode
 
-SparkContext, SQLContext, SparkSession (for spark 2.x) and ZeppelinContext are 
automatically created and exposed as variable names `sc`, `sqlContext`, `spark` 
and `z`, respectively, in Scala, Kotlin, Python and R environments.
+The default [interpreter binding 
mode](../usage/interpreter/interpreter_binding_mode.html) is `globally shared`. 
That means all notes share the same Spark interpreter.
 
+So we recommend you to use `isolated per note` which means each note has own 
Spark interpreter without affecting each other. But it may run out of your 
machine resource if too many
+Spark interpreters are created, so we recommend to always use yarn-cluster 
mode in production if you run Spark in hadoop cluster. And you can use [inline 
configuration](../usage/interpreter/overview.html#inline-generic-configuration) 
via `%spark.conf` in the first paragraph to customize your spark configuration.
 
-> Note that Scala/Python/R environment shares the same SparkContext, 
SQLContext, SparkSession and ZeppelinContext instance.
+You can also choose `scoped` mode. For `scoped` per note mode, Zeppelin 
creates separated scala compiler/python shell for each note but share a single 
`SparkContext/SqlContext/SparkSession`.
 
-## YARN Mode
-Zeppelin support both yarn client and yarn cluster mode (yarn cluster mode is 
supported from 0.8.0). For yarn mode, you must specify `SPARK_HOME` & 
`HADOOP_CONF_DIR`. 
-Usually you only have one hadoop cluster, so you can set `HADOOP_CONF_DIR` in 
`zeppelin-env.sh` which is applied to all spark interpreters. If you want to 
use spark against multiple hadoop cluster, then you need to define
-`HADOOP_CONF_DIR` in interpreter setting or via inline generic configuration.
 
-## Dependency Management
+## SparkContext, SQLContext, SparkSession, ZeppelinContext
 
-For spark interpreter, it is not recommended to use Zeppelin's [Dependency 
Management](../usage/interpreter/dependency_management.html) for managing 
-third party dependencies (`%spark.dep` is removed from Zeppelin 0.9 as well). 
Instead you should set the standard Spark properties.
+SparkContext, SQLContext, SparkSession (for spark 2.x, 3.x) and 
ZeppelinContext are automatically created and exposed as variable names `sc`, 
`sqlContext`, `spark` and `z` respectively, in Scala, Kotlin, Python and R 
environments.
 
-<table class="table-configuration">
-  <tr>
-    <th>Spark Property</th>
-    <th>Spark Submit Argument</th>
-    <th>Description</th>
-  </tr>
-  <tr>
-    <td>spark.files</td>
-    <td>--files</td>
-    <td>Comma-separated list of files to be placed in the working directory of 
each executor. Globs are allowed.</td>
-  </tr>
-  <tr>
-    <td>spark.jars</td>
-    <td>--jars</td>
-    <td>Comma-separated list of jars to include on the driver and executor 
classpaths. Globs are allowed.</td>
-  </tr>
-  <tr>
-    <td>spark.jars.packages</td>
-    <td>--packages</td>
-    <td>Comma-separated list of Maven coordinates of jars to include on the 
driver and executor classpaths. The coordinates should be 
groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will 
be resolved according to the configuration in the file, otherwise artifacts 
will be searched for in the local maven repo, then maven central and finally 
any additional remote repositories given by the command-line option 
--repositories.</td>
-  </tr>
-</table>
 
-You can either set Spark properties in interpreter setting page or set Spark 
submit arguments in `zeppelin-env.sh` via environment variable 
`SPARK_SUBMIT_OPTIONS`. 
-For examples:
+> Note that Scala/Python/R environment shares the same SparkContext, 
SQLContext, SparkSession and ZeppelinContext instance.
 
-```bash
-export SPARK_SUBMIT_OPTIONS="--files <my_file> --jars <my_jar> --packages 
<my_package>"
-```
+## Yarn Mode
+
+Zeppelin support both yarn client and yarn cluster mode (yarn cluster mode is 
supported from 0.8.0). For yarn mode, you must specify `SPARK_HOME` & 
`HADOOP_CONF_DIR`. 
+Usually you only have one hadoop cluster, so you can set `HADOOP_CONF_DIR` in 
`zeppelin-env.sh` which is applied to all Spark interpreters. If you want to 
use spark against multiple hadoop cluster, then you need to define
+`HADOOP_CONF_DIR` in interpreter setting or via inline generic configuration.
+
+## K8s Mode
 
-But it is not recommended to set them in `SPARK_SUBMIT_OPTIONS`. Because it 
will be shared by all spark interpreters, which means you can not set different 
dependencies for different users.
+Regarding how to run Spark on K8s in Zeppelin, please check [this 
doc](../quickstart/kubernetes.html).
 
 
 ## PySpark
 
-There're 2 ways to use PySpark in Zeppelin:
+There are 2 ways to use PySpark in Zeppelin:
 
 * Vanilla PySpark
 * IPySpark
 
 ### Vanilla PySpark (Not Recommended)
-Vanilla PySpark interpreter is almost the same as vanilla Python interpreter 
except Zeppelin inject SparkContext, SQLContext, SparkSession via variables 
`sc`, `sqlContext`, `spark`.
 
-By default, Zeppelin would use IPython in `%spark.pyspark` when IPython is 
available, Otherwise it would fall back to the original PySpark implementation.
-If you don't want to use IPython, then you can set 
`zeppelin.pyspark.useIPython` as `false` in interpreter setting. For the 
IPython features, you can refer doc
-[Python Interpreter](python.html)
+Vanilla PySpark interpreter is almost the same as vanilla Python interpreter 
except Spark interpreter inject SparkContext, SQLContext, SparkSession via 
variables `sc`, `sqlContext`, `spark`.
+
+By default, Zeppelin would use IPython in `%spark.pyspark` when IPython is 
available (Zeppelin would check whether ipython's prerequisites are met), 
Otherwise it would fall back to the vanilla PySpark implementation.
 
 ### IPySpark (Recommended)
-You can use `IPySpark` explicitly via `%spark.ipyspark`. IPySpark interpreter 
is almost the same as IPython interpreter except Zeppelin inject SparkContext, 
SQLContext, SparkSession via variables `sc`, `sqlContext`, `spark`.
-For the IPython features, you can refer doc [Python Interpreter](python.html)
+
+You can use `IPySpark` explicitly via `%spark.ipyspark`. IPySpark interpreter 
is almost the same as IPython interpreter except Spark interpreter inject 
SparkContext, SQLContext, SparkSession via variables `sc`, `sqlContext`, 
`spark`.
+For the IPython features, you can refer doc [Python 
Interpreter](python.html#ipython-interpreter-pythonipython-recommended)
 
 ## SparkR
 
-Zeppelin support SparkR via `%spark.r`. Here's configuration for SparkR 
Interpreter.
+Zeppelin support SparkR via `%spark.r`, `%spark.ir` and `%spark.shiny`. Here's 
configuration for SparkR Interpreter.
 
 <table class="table-configuration">
   <tr>
@@ -412,12 +471,28 @@ Zeppelin support SparkR via `%spark.r`. Here's 
configuration for SparkR Interpre
     <td>out.format = 'html', comment = NA, echo = FALSE, results = 'asis', 
message = F, warning = F, fig.retina = 2</td>
     <td>R plotting options.</td>
   </tr>
+  <tr>
+    <td>zeppelin.R.shiny.iframe_width</td>
+    <td>100%</td>
+    <td>IFrame width of Shiny App</td>
+  </tr>
+  <tr>
+    <td>zeppelin.R.shiny.iframe_height</td>
+    <td>500px</td>
+    <td>IFrame height of Shiny App</td>
+  </tr>
+  <tr>
+    <td>zeppelin.R.shiny.portRange</td>
+    <td>:</td>
+    <td>Shiny app would launch a web app at some port, this property is to 
specify the portRange via format '<start>:<end>', e.g. '5000:5001'. By default 
it is ':' which means any port</td>
+  </tr>
 </table>
 
+Refer [R doc](r.html) for how to use R in Zeppelin.
 
 ## SparkSql
 
-Spark Sql Interpreter share the same SparkContext/SparkSession with other 
Spark interpreter. That means any table registered in scala, python or r code 
can be accessed by Spark Sql.
+Spark sql interpreter share the same SparkContext/SparkSession with other 
Spark interpreters. That means any table registered in scala, python or r code 
can be accessed by Spark sql.
 For examples:
 
 ```scala
@@ -435,11 +510,13 @@ df.createOrReplaceTempView("people")
 select * from people
 ```
 
-By default, each sql statement would run sequentially in `%spark.sql`. But you 
can run them concurrently by following setup.
+You can write multiple sql statements in one paragraph. Each sql statement is 
separated by semicolon.
+Sql statement in one paragraph would run sequentially. 
+But sql statements in different paragraphs can run concurrently by the 
following configuration.
 
-1. Set `zeppelin.spark.concurrentSQL` to true to enable the sql concurrent 
feature, underneath zeppelin will change to use fairscheduler for spark. And 
also set `zeppelin.spark.concurrentSQL.max` to control the max number of sql 
statements running concurrently.
+1. Set `zeppelin.spark.concurrentSQL` to true to enable the sql concurrent 
feature, underneath zeppelin will change to use fairscheduler for Spark. And 
also set `zeppelin.spark.concurrentSQL.max` to control the max number of sql 
statements running concurrently.
 2. Configure pools by creating `fairscheduler.xml` under your 
`SPARK_CONF_DIR`, check the official spark doc [Configuring Pool 
Properties](http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties)
-3. Set pool property via setting paragraph property. e.g.
+3. Set pool property via setting paragraph local property. e.g.
 
  ```
  %spark(pool=pool1)
@@ -448,25 +525,61 @@ By default, each sql statement would run sequentially in 
`%spark.sql`. But you c
  ```
 
 This pool feature is also available for all versions of scala Spark, PySpark. 
For SparkR, it is only available starting from 2.3.0.
- 
-## Interpreter Setting Option
 
-You can choose one of `shared`, `scoped` and `isolated` options when you 
configure Spark interpreter.
-e.g. 
+## Dependency Management
+
+For Spark interpreter, it is not recommended to use Zeppelin's [Dependency 
Management](../usage/interpreter/dependency_management.html) for managing
+third party dependencies (`%spark.dep` is removed from Zeppelin 0.9 as well). 
Instead, you should set the standard Spark properties as following:
+
+<table class="table-configuration">
+  <tr>
+    <th>Spark Property</th>
+    <th>Spark Submit Argument</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>spark.files</td>
+    <td>--files</td>
+    <td>Comma-separated list of files to be placed in the working directory of 
each executor. Globs are allowed.</td>
+  </tr>
+  <tr>
+    <td>spark.jars</td>
+    <td>--jars</td>
+    <td>Comma-separated list of jars to include on the driver and executor 
classpaths. Globs are allowed.</td>
+  </tr>
+  <tr>
+    <td>spark.jars.packages</td>
+    <td>--packages</td>
+    <td>Comma-separated list of Maven coordinates of jars to include on the 
driver and executor classpaths. The coordinates should be 
groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will 
be resolved according to the configuration in the file, otherwise artifacts 
will be searched for in the local maven repo, then maven central and finally 
any additional remote repositories given by the command-line option 
--repositories.</td>
+  </tr>
+</table>
+
+As general Spark properties, you can set them in via inline configuration or 
interpreter setting page or in `zeppelin-env.sh` via environment variable 
`SPARK_SUBMIT_OPTIONS`.
+For examples:
+
+```bash
+export SPARK_SUBMIT_OPTIONS="--files <my_file> --jars <my_jar> --packages 
<my_package>"
+```
+
+To be noticed, `SPARK_SUBMIT_OPTIONS` is deprecated and will be removed in 
future release.
 
-* In `scoped` per user mode, Zeppelin creates separated Scala compiler for 
each user but share a single SparkContext.
-* In `isolated` per user mode, Zeppelin creates separated SparkContext for 
each user.
 
 ## ZeppelinContext
+
 Zeppelin automatically injects `ZeppelinContext` as variable `z` in your 
Scala/Python environment. `ZeppelinContext` provides some additional functions 
and utilities.
-See [Zeppelin-Context](../usage/other_features/zeppelin_context.html) for more 
details.
+See [Zeppelin-Context](../usage/other_features/zeppelin_context.html) for more 
details. For Spark interpreter, you can use z to display Spark 
`Dataset/Dataframe`.
+
+
+<img src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/spark_zshow.png">
+
 
 ## Setting up Zeppelin with Kerberos
+
 Logical setup with Zeppelin, Kerberos Key Distribution Center (KDC), and Spark 
on YARN:
 
 <img src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/kdc_zeppelin.png">
 
-There're several ways to make spark work with kerberos enabled hadoop cluster 
in Zeppelin. 
+There are several ways to make Spark work with kerberos enabled hadoop cluster 
in Zeppelin. 
 
 1. Share one single hadoop cluster.
 In this case you just need to specify `zeppelin.server.kerberos.keytab` and 
`zeppelin.server.kerberos.principal` in zeppelin-site.xml, Spark interpreter 
will use these setting by default.
@@ -474,11 +587,26 @@ In this case you just need to specify 
`zeppelin.server.kerberos.keytab` and `zep
 2. Work with multiple hadoop clusters.
 In this case you can specify `spark.yarn.keytab` and `spark.yarn.principal` to 
override `zeppelin.server.kerberos.keytab` and 
`zeppelin.server.kerberos.principal`.
 
+### Configuration Setup
+
+1. On the server that Zeppelin is installed, install Kerberos client modules 
and configuration, krb5.conf.
+   This is to make the server communicate with KDC.
+
+2. Add the two properties below to Spark configuration 
(`[SPARK_HOME]/conf/spark-defaults.conf`):
+
+    ```
+    spark.yarn.principal
+    spark.yarn.keytab
+    ```
+
+> **NOTE:** If you do not have permission to access for the above 
spark-defaults.conf file, optionally, you can add the above lines to the Spark 
Interpreter setting through the Interpreter tab in the Zeppelin UI.
+
+3. That's it. Play with Zeppelin!
 
 ## User Impersonation
 
-In yarn mode, the user who launch the zeppelin server will be used to launch 
the spark yarn application. This is not a good practise.
-Most of time, you will enable shiro in Zeppelin and would like to use the 
login user to submit the spark yarn app. For this purpose,
+In yarn mode, the user who launch the zeppelin server will be used to launch 
the Spark yarn application. This is not a good practise.
+Most of time, you will enable shiro in Zeppelin and would like to use the 
login user to submit the Spark yarn app. For this purpose,
 you need to enable user impersonation for more security control. In order the 
enable user impersonation, you need to do the following steps
 
 **Step 1** Enable user impersonation setting hadoop's `core-site.xml`. E.g. if 
you are using user `zeppelin` to launch Zeppelin, then add the following to 
`core-site.xml`, then restart both hdfs and yarn. 
@@ -508,19 +636,7 @@ You can get rid of this message by setting 
`zeppelin.spark.deprecatedMsg.show` t
 
 <img 
src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/spark_deprecate.png">
 
-### Configuration Setup
-
-1. On the server that Zeppelin is installed, install Kerberos client modules 
and configuration, krb5.conf.
-This is to make the server communicate with KDC.
-
-2. Add the two properties below to Spark configuration 
(`[SPARK_HOME]/conf/spark-defaults.conf`):
 
-    ```
-    spark.yarn.principal
-    spark.yarn.keytab
-    ```
-
-  > **NOTE:** If you do not have permission to access for the above 
spark-defaults.conf file, optionally, you can add the above lines to the Spark 
Interpreter setting through the Interpreter tab in the Zeppelin UI.
-
-3. That's it. Play with Zeppelin!
+## Community
 
+[Join our community](http://zeppelin.apache.org/community.html) to discuss 
with others.
diff --git a/docs/quickstart/flink_with_zeppelin.md 
b/docs/quickstart/flink_with_zeppelin.md
new file mode 100644
index 0000000..70f7970
--- /dev/null
+++ b/docs/quickstart/flink_with_zeppelin.md
@@ -0,0 +1,42 @@
+---
+layout: page
+title: "Flink with Zeppelin"
+description: ""
+group: quickstart
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+{% include JB/setup %}
+
+# Flink support in Zeppelin 
+
+<div id="toc"></div>
+
+<br/>
+
+For a brief overview of Apache Flink fundamentals with Apache Zeppelin, see 
the following guide:
+
+- **built-in** Apache Flink integration.
+- With [Flink Scala 
Scala](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/repls/scala_shell/)
 [PyFlink 
Shell](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/repls/python_shell/),
 [Flink 
SQL](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sql/overview/)
+- Inject ExecutionEnvironment, StreamExecutionEnvironment, 
BatchTableEnvironment, StreamTableEnvironment.
+- Canceling job and displaying its progress 
+- Supports different modes: local, remote, yarn, yarn-application
+- Dependency management
+- Streaming Visualization
+
+<br/>
+
+For the further information about Flink support in Zeppelin, please check 
+
+- [Flink Interpreter](../interpreter/flink.html)
diff --git a/docs/quickstart/install.md b/docs/quickstart/install.md
index aa14d9f..4606e0f 100644
--- a/docs/quickstart/install.md
+++ b/docs/quickstart/install.md
@@ -50,7 +50,7 @@ Two binary packages are available on the [download 
page](http://zeppelin.apache.
 
 - **all interpreter package**: unpack it in a directory of your choice and 
you're ready to go.
 - **net-install interpreter package**: only spark, python, markdown and shell 
interpreter included. Unpack and follow [install additional 
interpreters](../usage/interpreter/installation.html) to install other 
interpreters. If you're unsure, just run `./bin/install-interpreter.sh --all` 
and install all interpreters.
-  
+
 ### Building Zeppelin from source
 
 Follow the instructions [How to Build](../setup/basics/how_to_build.html), If 
you want to build from source instead of using binary package.
@@ -67,9 +67,11 @@ bin/zeppelin-daemon.sh start
 
 After Zeppelin has started successfully, go to 
[http://localhost:8080](http://localhost:8080) with your web browser.
 
-By default Zeppelin is listening at `127.0.0.1:8080`, so you can't access it 
when it is deployed in another remote machine.
+By default Zeppelin is listening at `127.0.0.1:8080`, so you can't access it 
when it is deployed on another remote machine.
 To access a remote Zeppelin, you need to change `zeppelin.server.addr` to 
`0.0.0.0` in `conf/zeppelin-site.xml`.
 
+Check log file at `ZEPPELIN_HOME/logs/zeppelin-server-*.log` if you can not 
open Zeppelin.
+
 #### Stopping Zeppelin
 
 ```
@@ -84,15 +86,27 @@ Make sure that 
[docker](https://www.docker.com/community-edition) is installed i
 Use this command to launch Apache Zeppelin in a container.
 
 ```bash
-docker run -p 8080:8080 --rm --name zeppelin apache/zeppelin:0.9.0
+docker run -p 8080:8080 --rm --name zeppelin apache/zeppelin:0.10.0
 
 ```
+
 To persist `logs` and `notebook` directories, use the 
[volume](https://docs.docker.com/engine/reference/commandline/run/#mount-volume--v-read-only)
 option for docker container.
 
 ```bash
-docker run -p 8080:8080 --rm -v $PWD/logs:/logs -v $PWD/notebook:/notebook \
+docker run -u $(id -u) -p 8080:8080 --rm -v $PWD/logs:/logs -v 
$PWD/notebook:/notebook \
            -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
-           --name zeppelin apache/zeppelin:0.9.0
+           --name zeppelin apache/zeppelin:0.10.0
+```
+
+`-u $(id -u)` is to make sure you have the permission to write logs and 
notebooks. 
+
+For many interpreters, they require other dependencies, e.g. Spark interpreter 
requires Spark binary distribution
+and Flink interpreter requires Flink binary distribution. You can also mount 
them via docker volumn. e.g.
+
+```bash
+docker run -u $(id -u) -p 8080:8080 --rm -v /mnt/disk1/notebook:/notebook \
+-v /usr/lib/spark-current:/opt/spark -v /mnt/disk1/flink-1.12.2:/opt/flink -e 
FLINK_HOME=/opt/flink  \
+-e SPARK_HOME=/opt/spark  -e ZEPPELIN_NOTEBOOK_DIR='/notebook' --name zeppelin 
apache/zeppelin:0.10.0
 ```
 
 If you have trouble accessing `localhost:8080` in the browser, Please clear 
browser cache.
@@ -146,13 +160,15 @@ Congratulations, you have successfully installed Apache 
Zeppelin! Here are a few
 
 #### New to Apache Zeppelin...
  * For an in-depth overview, head to [Explore Zeppelin 
UI](../quickstart/explore_ui.html).
- * And then, try run [Tutorial 
Notebook](http://localhost:8080/#/notebook/2A94M5J1Z) in your Zeppelin.
+ * And then, try run Tutorial Notebooks shipped with your Zeppelin 
distribution.
  * And see how to change 
[configurations](../setup/operation/configuration.html) like port number, etc.
 
-#### Spark, Python, SQL, and more 
+#### Spark, Flink, SQL, Python, R and more 
  * [Spark support in Zeppelin](./spark_with_zeppelin.html), to know more about 
deep integration with [Apache Spark](http://spark.apache.org/). 
+ * [Flink support in Zeppelin](./flink_with_zeppelin.html), to know more about 
deep integration with [Apache Flink](http://flink.apache.org/).
  * [SQL support in Zeppelin](./sql_with_zeppelin.html) for SQL support
  * [Python support in Zeppelin](./python_with_zeppelin.html), for Matplotlib, 
Pandas, Conda/Docker integration.
+ * [R support in Zeppelin](./r_with_zeppelin.html)
  * [All Available Interpreters](../#available-interpreters)
 
 #### Multi-user support ...
diff --git a/docs/quickstart/python_with_zeppelin.md 
b/docs/quickstart/python_with_zeppelin.md
index 80237f8..76b3d58 100644
--- a/docs/quickstart/python_with_zeppelin.md
+++ b/docs/quickstart/python_with_zeppelin.md
@@ -27,16 +27,17 @@ limitations under the License.
 
 The following guides explain how to use Apache Zeppelin that enables you to 
write in Python:
 
+- supports [vanilla 
python](../interpreter/python.html#vanilla-python-interpreter-python) and 
[ipython](../interpreter/python.html#ipython-interpreter-pythonipython-recommended)
 - supports flexible python environments using 
[conda](../interpreter/python.html#conda), 
[docker](../interpreter/python.html#docker)  
 - can query using 
[PandasSQL](../interpreter/python.html#sql-over-pandas-dataframes)
 - also, provides [PySpark](../interpreter/spark.html)
+- [run python interpreter in yarn 
cluster](../interpreter/python.html#run-python-in-yarn-cluster) with customized 
conda python environment.
 - with [matplotlib 
integration](../interpreter/python.html#matplotlib-integration)
-- support 
[ipython](../interpreter/python.html#ipython-interpreter-pythonipython-recommended)
 
 - can create results including **UI widgets** using [Dynamic 
Form](../interpreter/python.html#using-zeppelin-dynamic-forms)
 
 <br/>
 
-For the further information about Spark support in Zeppelin, please check 
+For the further information about Python support in Zeppelin, please check 
 
 - [Python Interpreter](../interpreter/python.html)
 
diff --git a/docs/quickstart/r_with_zeppelin.md 
b/docs/quickstart/r_with_zeppelin.md
new file mode 100644
index 0000000..f9b9feb
--- /dev/null
+++ b/docs/quickstart/r_with_zeppelin.md
@@ -0,0 +1,42 @@
+---
+layout: page
+title: "R with Zeppelin"
+description: ""
+group: quickstart
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+{% include JB/setup %}
+
+# R support in Zeppelin
+
+<div id="toc"></div>
+
+<br/>
+
+The following guides explain how to use Apache Zeppelin that enables you to 
write in R:
+
+- Supports [vanilla R](../interpreter/r.html#how-to-use-r-interpreter) and 
[IRkernel](../interpreter/r.html#how-to-use-r-interpreter)
+- Visualize R dataframe via [ZeppelinContext](../interpreter/r.html#zshow)
+- [Run R interpreter in yarn 
cluster](../interpreter/r.html#run-r-in-yarn-cluster) with customized conda R 
environment.
+- [Make R Shiny App] (../interpreter/r.html#make-shiny-app-in-zeppelin)
+
+<br/>
+
+For the further information about R support in Zeppelin, please check
+
+- [R Interpreter](../interpreter/r.html)
+
+
+
diff --git a/docs/quickstart/spark_with_zeppelin.md 
b/docs/quickstart/spark_with_zeppelin.md
index 6b35beb..7250f00 100644
--- a/docs/quickstart/spark_with_zeppelin.md
+++ b/docs/quickstart/spark_with_zeppelin.md
@@ -28,12 +28,13 @@ limitations under the License.
 For a brief overview of Apache Spark fundamentals with Apache Zeppelin, see 
the following guide:
 
 - **built-in** Apache Spark integration.
-- with [SparkSQL](http://spark.apache.org/sql/), 
[PySpark](https://spark.apache.org/docs/latest/api/python/pyspark.html), 
[SparkR](https://spark.apache.org/docs/latest/sparkr.html)
-- inject 
[SparkContext](https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html),
 [SQLContext](https://spark.apache.org/docs/latest/sql-programming-guide.html) 
and 
[SparkSession](https://spark.apache.org/docs/latest/sql-programming-guide.html) 
automatically
-- canceling job and displaying its progress 
-- supporting [Spark Cluster 
Mode](../setup/deployment/spark_cluster_mode.html#apache-zeppelin-on-spark-cluster-mode)
 for external spark clusters
-- supports [different context per user / 
note](../usage/interpreter/interpreter_binding_mode.html) 
-- sharing variables among PySpark, SparkR and Spark through 
[ZeppelinContext](../interpreter/spark.html#zeppelincontext)
+- With [Spark Scala](https://spark.apache.org/docs/latest/quick-start.html) 
[SparkSQL](http://spark.apache.org/sql/), 
[PySpark](https://spark.apache.org/docs/latest/api/python/pyspark.html), 
[SparkR](https://spark.apache.org/docs/latest/sparkr.html)
+- Inject 
[SparkContext](https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html),
 [SQLContext](https://spark.apache.org/docs/latest/sql-programming-guide.html) 
and 
[SparkSession](https://spark.apache.org/docs/latest/sql-programming-guide.html) 
automatically
+- Canceling job and displaying its progress 
+- Supports different modes: local, standalone, yarn(client & cluster), k8s
+- Dependency management
+- Supports [different context per user / 
note](../usage/interpreter/interpreter_binding_mode.html) 
+- Sharing variables among PySpark, SparkR and Spark through 
[ZeppelinContext](../interpreter/spark.html#zeppelincontext)
 - [Livy Interpreter](../interpreter/livy.html)
 
 <br/>
diff --git a/docs/quickstart/sql_with_zeppelin.md 
b/docs/quickstart/sql_with_zeppelin.md
index df63ccd..e007f20 100644
--- a/docs/quickstart/sql_with_zeppelin.md
+++ b/docs/quickstart/sql_with_zeppelin.md
@@ -38,6 +38,7 @@ The following guides explain how to use Apache Zeppelin that 
enables you to writ
   * [Apache Tajo](../interpreter/jdbc.html#apache-tajo)
   * and so on 
 - [Spark Interpreter](../interpreter/spark.html) supports 
[SparkSQL](http://spark.apache.org/sql/)
+- [Flink Interpreter](../interpreter/flink.html) supports [Flink 
SQL](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sql/overview/)
 - [Python Interpreter](../interpreter/python.html) supports 
[pandasSQL](../interpreter/python.html#sql-over-pandas-dataframes) 
 - can create query result including **UI widgets** using [Dynamic 
Form](../usage/dynamic_form/intro.html)
 
@@ -56,6 +57,7 @@ For the further information about SQL support in Zeppelin, 
please check
 
 - [JDBC Interpreter](../interpreter/jdbc.html)
 - [Spark Interpreter](../interpreter/spark.html)
+- [Flink Interpreter](../interpreter/flink.html)
 - [Python Interpreter](../interpreter/python.html)
 - [IgniteSQL Interpreter](../interpreter/ignite.html#ignite-sql-interpreter) 
for [Apache Ignite](https://ignite.apache.org/)
 - [Kylin Interpreter](../interpreter/kylin.html) for [Apache 
Kylin](http://kylin.apache.org/)
diff --git a/docs/setup/deployment/flink_and_spark_cluster.md 
b/docs/setup/deployment/flink_and_spark_cluster.md
index c793651..8aaa495 100644
--- a/docs/setup/deployment/flink_and_spark_cluster.md
+++ b/docs/setup/deployment/flink_and_spark_cluster.md
@@ -20,6 +20,8 @@ limitations under the License.
 
 {% include JB/setup %}
 
+<font color=red>This document is outdated, it is not verified in the latest 
Zeppelin.</font>
+
 # Install with Flink and Spark cluster
 
 <div id="toc"></div>
diff --git a/docs/usage/interpreter/dependency_management.md 
b/docs/usage/interpreter/dependency_management.md
index f4aeb44..d616d4e 100644
--- a/docs/usage/interpreter/dependency_management.md
+++ b/docs/usage/interpreter/dependency_management.md
@@ -24,13 +24,14 @@ limitations under the License.
 
 You can include external libraries to interpreter by setting dependencies in 
interpreter menu.
 
+To be noticed, this approach doesn't work for spark and flink interpreters. 
They have their own dependency management, please refer their doc for details.
+
 When your code requires external library, instead of doing 
download/copy/restart Zeppelin, you can easily do following jobs in this menu.
 
  * Load libraries recursively from Maven repository
  * Load libraries from local filesystem
  * Add additional maven repository
- * Automatically add libraries to SparkCluster
-
+ 
 <hr>
 <div class="row">
   <div class="col-md-6">
diff --git a/notebook/Spark Tutorial/8. PySpark Conda Env in Yarn 
Mode_2GE79Y5FV.zpln b/notebook/Spark Tutorial/8. PySpark Conda Env in Yarn 
Mode_2GE79Y5FV.zpln
new file mode 100644
index 0000000..7532541
--- /dev/null
+++ b/notebook/Spark Tutorial/8. PySpark Conda Env in Yarn Mode_2GE79Y5FV.zpln  
@@ -0,0 +1,1397 @@
+{
+  "paragraphs": [
+    {
+      "text": "%md\n\nThis tutorial is for how to customize pyspark runtime 
environment via conda in yarn-cluster mode.\nIn this approach, the spark 
interpreter (driver) and spark executor all run in yarn containers. \nAnd 
remmeber this approach only works when ipython is enabled, so make sure you 
include the following python packages in your conda env which are required for 
ipython.\n\n* jupyter\n* grpcio\n* protobuf\n\nThis turorial is only verified 
with spark 3.1.2, other versions of  [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:25:07.164",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": false,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003eThis tutorial is for how to 
customize pyspark runtime environment via conda in yarn-cluster mode.\u003cbr 
/\u003e\nIn this approach, the spark interpreter (driver) and spark executor 
all run in yarn containers.\u003cbr /\u003e\nAnd remmeber this approach only 
works when ipython is enabled, so make sure you include the following python 
packages in your conda env which are required for ipython.\u003c/p\u003e\n\ 
[...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499052501_412639221",
+      "id": "paragraph_1616510705826_532544979",
+      "dateCreated": "2021-08-09 16:50:52.501",
+      "dateStarted": "2021-08-09 20:25:07.190",
+      "dateFinished": "2021-08-09 20:25:09.774",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Create Python conda env",
+      "text": "%sh\n\n# make sure you have conda and momba installed.\n# 
install miniconda: https://docs.conda.io/en/latest/miniconda.html\n# install 
mamba: https://github.com/mamba-org/mamba\n\necho \"name: 
pyspark_env\nchannels:\n  - conda-forge\n  - defaults\ndependencies:\n  - 
python\u003d3.8 \n  - jupyter\n  - grpcio\n  - protobuf\n  - pandasql\n  - 
pycodestyle\n  # use numpy \u003c 1.20, otherwise the following pandas udf 
example will fail, see https://github.com/Azure/MachineLearn [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:25:09.790",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "sh",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/sh",
+        "fontSize": 9.0,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "\nRemove all packages in environment 
/mnt/disk1/jzhang/miniconda3/envs/pyspark_env:\n\npkgs/r/linux-64           
\npkgs/main/noarch          \npkgs/main/linux-64        \npkgs/r/noarch         
    \nconda-forge/noarch        \nconda-forge/linux-64      \nTransaction\n\n  
Prefix: /mnt/disk1/jzhang/miniconda3/envs/pyspark_env\n\n  Updating specs:\n\n  
 - python\u003d3.8\n   - jupyter\n   - grpcio\n   - protobuf\n   - pandasql\n   
- pycodestyle\n   - numpy\u003d\u003d1. [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499052502_43002557",
+      "id": "paragraph_1617163651950_276096757",
+      "dateCreated": "2021-08-09 16:50:52.502",
+      "dateStarted": "2021-08-09 20:25:09.796",
+      "dateFinished": "2021-08-09 20:26:01.854",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Create Python conda tar",
+      "text": "%sh\n\nrm -rf pyspark_env.tar.gz\nconda pack -n pyspark_env\n",
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:26:01.935",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "sh",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/sh",
+        "fontSize": 9.0,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "Collecting packages...\nPacking environment at 
\u0027/mnt/disk1/jzhang/miniconda3/envs/pyspark_env\u0027 to 
\u0027pyspark_env.tar.gz\u0027\n\r[                                        ] | 
0% Completed |  0.0s\r[                                        ] | 0% Completed 
|  0.1s\r[                                        ] | 0% Completed |  0.2s\r[   
                                     ] | 0% Completed |  0.3s\r[                
                        ] | 1% Completed |   [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499052502_1290046721",
+      "id": "paragraph_1617170106834_1523620028",
+      "dateCreated": "2021-08-09 16:50:52.502",
+      "dateStarted": "2021-08-09 20:26:01.944",
+      "dateFinished": "2021-08-09 20:27:03.580",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Upload Python conda tar to hdfs",
+      "text": "%sh\n\nhadoop fs -rmr /tmp/pyspark_env.tar.gz\nhadoop fs -put 
pyspark_env.tar.gz /tmp\n# The python conda tar should be public accessible, so 
need to change permission here.\nhadoop fs -chmod 644 
/tmp/pyspark_env.tar.gz\n",
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:27:03.588",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "sh",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/sh",
+        "fontSize": 9.0,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "rmr: DEPRECATED: Please use \u0027-rm -r\u0027 
instead.\n21/08/09 20:27:05 INFO fs.TrashPolicyDefault: Moved: 
\u0027hdfs://emr-header-1.cluster-46718:9000/tmp/pyspark_env.tar.gz\u0027 to 
trash at: 
hdfs://emr-header-1.cluster-46718:9000/user/hadoop/.Trash/Current/tmp/pyspark_env.tar.gz\n"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499052503_165730412",
+      "id": "paragraph_1617163700271_1335210825",
+      "dateCreated": "2021-08-09 16:50:52.503",
+      "dateStarted": "2021-08-09 20:27:03.591",
+      "dateFinished": "2021-08-09 20:27:10.407",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Configure Spark Interpreter",
+      "text": "%spark.conf\n\n# set the following 2 properties to run spark in 
yarn-cluster mode\nspark.master yarn\nspark.submit.deployMode 
cluster\n\nspark.driver.memory 4g\nspark.executor.memory 4g\n\n# 
spark.yarn.dist.archives can be either local file or hdfs 
file\nspark.yarn.dist.archives hdfs:///tmp/pyspark_env.tar.gz#environment\n# 
spark.yarn.dist.archives 
pyspark_env.tar.gz#environment\n\nzeppelin.interpreter.conda.env.name 
environment\n\nspark.sql.execution.arrow.pyspark.enabled [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:27:10.499",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "text",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/text",
+        "fontSize": 9.0,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": []
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499052503_1438301861",
+      "id": "paragraph_1616750271530_2029224504",
+      "dateCreated": "2021-08-09 16:50:52.503",
+      "dateStarted": "2021-08-09 20:27:10.506",
+      "dateFinished": "2021-08-09 20:27:10.517",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Use Matplotlib",
+      "text": "%md\n\nThe following example use matplotlib in pyspark. Here 
the matplotlib is only used in spark driver.\n",
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:27:10.603",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003eThe following example use 
matplotlib in pyspark. Here the matplotlib is only used in spark 
driver.\u003c/p\u003e\n\n\u003c/div\u003e"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628502898787_1101584010",
+      "id": "paragraph_1628502898787_1101584010",
+      "dateCreated": "2021-08-09 17:54:58.787",
+      "dateStarted": "2021-08-09 20:27:10.607",
+      "dateFinished": "2021-08-09 20:27:10.614",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Use Matplotlib",
+      "text": "%spark.pyspark\n\n%matplotlib inline\n\nimport 
matplotlib.pyplot as plt\n\nplt.plot([1,2,3,4])\nplt.ylabel(\u0027some 
numbers\u0027)\nplt.show()\n",
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:27:10.707",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "title": false,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "IMG",
+            "data": 
"iVBORw0KGgoAAAANSUhEUgAAAYIAAAD4CAYAAADhNOGaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAAsTAAALEwEAmpwYAAAmEUlEQVR4nO3dd3xV9f3H8dcHCHsbRhhhb4KIYTjqHoAo4mitra1aRa3+OhUQtahYd4etVcSqldbaWsKS4d5boJLBDEv2lIQVsj6/P+7194sxkBvIzcnNfT8fjzy499zvvfdzPJg355zv+Rxzd0REJH7VCroAEREJloJARCTOKQhEROKcgkBEJM4pCERE4lydoAuoqMTERO/cuXPQZYiIxJRFixbtdPdWZb0Wc0HQuXNnFi5cGHQZIiIxxczWH+41HRoSEYlzCgIRkTinIBARiXMKAhGROKcgEBGJc1EPAjOrbWb
 [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499052504_67784564",
+      "id": "paragraph_1623916874799_812799753",
+      "dateCreated": "2021-08-09 16:50:52.504",
+      "dateStarted": "2021-08-09 20:27:10.711",
+      "dateFinished": "2021-08-09 20:28:22.417",
+      "status": "FINISHED"
+    },
+    {
+      "title": "PySpark UDF using Pandas and PyArrow",
+      "text": "%md\n\nFollowing are examples of using pandas and pyarrow in 
udf. Here we use python packages in both spark driver and executors. All the 
examples are from [apache spark official 
document](https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html#recommended-pandas-and-pyarrow-versions)",
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:22.458",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003eFollowing are examples of 
using pandas and pyarrow in udf. Here we use python packages in both spark 
driver and executors. All the examples are from \u003ca 
href\u003d\"https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html#recommended-pandas-and-pyarrow-versions\"\u003eapache
 spark official document\u003c/a\u003e\u003c/p\u003e\n\n\u003c/div\u003e"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628502428567_60098788",
+      "id": "paragraph_1628502428567_60098788",
+      "dateCreated": "2021-08-09 17:47:08.568",
+      "dateStarted": "2021-08-09 20:28:22.461",
+      "dateFinished": "2021-08-09 20:28:22.478",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Enabling for Conversion to/from Pandas",
+      "text": "%md\n\nArrow is available as an optimization when converting a 
Spark DataFrame to a Pandas DataFrame using the call `DataFrame.toPandas()` and 
when creating a Spark DataFrame from a Pandas DataFrame with 
`SparkSession.createDataFrame()`. To use Arrow when executing these calls, 
users need to first set the Spark configuration 
`spark.sql.execution.arrow.pyspark.enabled` to true. This is disabled by 
default.\n\nIn addition, optimizations enabled by `spark.sql.execution.arrow. 
[...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:22.561",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003eArrow is available as an 
optimization when converting a Spark DataFrame to a Pandas DataFrame using the 
call \u003ccode\u003eDataFrame.toPandas()\u003c/code\u003e and when creating a 
Spark DataFrame from a Pandas DataFrame with 
\u003ccode\u003eSparkSession.createDataFrame()\u003c/code\u003e. To use Arrow 
when executing these calls, users need to first set the Spark configuration 
\u003ccode\u003espark.sql.exec [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628503042999_590218180",
+      "id": "paragraph_1628503042999_590218180",
+      "dateCreated": "2021-08-09 17:57:22.999",
+      "dateStarted": "2021-08-09 20:28:22.565",
+      "dateFinished": "2021-08-09 20:28:22.574",
+      "status": "FINISHED"
+    },
+    {
+      "text": "%spark.pyspark\n\nimport pandas as pd\nimport numpy as np\n\n# 
Generate a Pandas DataFrame\npdf \u003d pd.DataFrame(np.random.rand(100, 
3))\n\n# Create a Spark DataFrame from a Pandas DataFrame using Arrow\ndf 
\u003d spark.createDataFrame(pdf)\n\n# Convert the Spark DataFrame back to a 
Pandas DataFrame using Arrow\nresult_pdf \u003d df.select(\"*\").toPandas()\n",
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:22.664",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": []
+      },
+      "apps": [],
+      "runtimeInfos": {
+        "jobUrl": {
+          "propertyName": "jobUrl",
+          "label": "SPARK JOB",
+          "tooltip": "View in Spark web UI",
+          "group": "spark",
+          "values": [
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d0";
+            }
+          ],
+          "interpreterSettingId": "spark"
+        }
+      },
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499052504_328504071",
+      "id": "paragraph_1628487947468_761461400",
+      "dateCreated": "2021-08-09 16:50:52.504",
+      "dateStarted": "2021-08-09 20:28:22.668",
+      "dateFinished": "2021-08-09 20:28:27.045",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Pandas UDFs (a.k.a. Vectorized UDFs)",
+      "text": "%md\n\nPandas UDFs are user defined functions that are executed 
by Spark using Arrow to transfer data and Pandas to work with the data, which 
allows vectorized operations. A Pandas UDF is defined using the `pandas_udf()` 
as a decorator or to wrap the function, and no additional configuration is 
required. A Pandas UDF behaves as a regular PySpark function API in 
general.\n\nBefore Spark 3.0, Pandas UDFs used to be defined with 
`pyspark.sql.functions.PandasUDFType`. From Spa [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:27.071",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003ePandas UDFs are user defined 
functions that are executed by Spark using Arrow to transfer data and Pandas to 
work with the data, which allows vectorized operations. A Pandas UDF is defined 
using the \u003ccode\u003epandas_udf()\u003c/code\u003e as a decorator or to 
wrap the function, and no additional configuration is required. A Pandas UDF 
behaves as a regular PySpark function API in general.\u003c/p\u003e\n [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628503123247_32503996",
+      "id": "paragraph_1628503123247_32503996",
+      "dateCreated": "2021-08-09 17:58:43.248",
+      "dateStarted": "2021-08-09 20:28:27.075",
+      "dateFinished": "2021-08-09 20:28:27.083",
+      "status": "FINISHED"
+    },
+    {
+      "text": "%spark.pyspark\n\nimport pandas as pd\n\nfrom 
pyspark.sql.functions import pandas_udf\n\n@pandas_udf(\"col1 string, col2 
long\")\ndef func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -\u003e 
pd.DataFrame:\n    s3[\u0027col2\u0027] \u003d s1 + s2.str.len()\n    return 
s3\n\n# Create a Spark DataFrame that has three columns including a struct 
column.\ndf \u003d spark.createDataFrame(\n    [[1, \"a string\", (\"a nested 
string\",)]],\n    \"long_col long, string_col strin [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:27.174",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "root\n |-- long_col: long (nullable \u003d true)\n |-- 
string_col: string (nullable \u003d true)\n |-- struct_col: struct (nullable 
\u003d true)\n |    |-- col1: string (nullable \u003d true)\n\nroot\n |-- 
func(long_col, string_col, struct_col): struct (nullable \u003d true)\n |    
|-- col1: string (nullable \u003d true)\n |    |-- col2: long (nullable \u003d 
true)\n\n"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499507315_836384477",
+      "id": "paragraph_1628499507315_836384477",
+      "dateCreated": "2021-08-09 16:58:27.315",
+      "dateStarted": "2021-08-09 20:28:27.177",
+      "dateFinished": "2021-08-09 20:28:27.797",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Series to Series",
+      "text": "%md\n\nThe type hint can be expressed as `pandas.Series`, … 
-\u003e `pandas.Series`.\n\nBy using `pandas_udf()` with the function having 
such type hints above, it creates a Pandas UDF where the given function takes 
one or more `pandas.Series` and outputs one `pandas.Series`. The output of the 
function should always be of the same length as the input. Internally, PySpark 
will execute a Pandas UDF by splitting columns into batches and calling the 
function for each batch as a [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:27.878",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003eThe type hint can be expressed 
as \u003ccode\u003epandas.Series\u003c/code\u003e, … -\u0026gt; 
\u003ccode\u003epandas.Series\u003c/code\u003e.\u003c/p\u003e\n\u003cp\u003eBy 
using \u003ccode\u003epandas_udf()\u003c/code\u003e with the function having 
such type hints above, it creates a Pandas UDF where the given function takes 
one or more \u003ccode\u003epandas.Series\u003c/code\u003e and outputs one 
\u003cco [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628503203208_1371139053",
+      "id": "paragraph_1628503203208_1371139053",
+      "dateCreated": "2021-08-09 18:00:03.208",
+      "dateStarted": "2021-08-09 20:28:27.881",
+      "dateFinished": "2021-08-09 20:28:27.889",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Series to Series",
+      "text": "%spark.pyspark\n\nimport pandas as pd\n\nfrom 
pyspark.sql.functions import col, pandas_udf\nfrom pyspark.sql.types import 
LongType\n\n# Declare the function and create the UDF\ndef multiply_func(a: 
pd.Series, b: pd.Series) -\u003e pd.Series:\n    return a * b\n\nmultiply 
\u003d pandas_udf(multiply_func, returnType\u003dLongType())\n\n# The function 
for a pandas_udf should be able to execute with local Pandas data\nx \u003d 
pd.Series([1, 2, 3])\nprint(multiply_func(x, x))\n [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:27.981",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "title": false,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "0    1\n1    4\n2    9\ndtype: 
int64\n+-------------------+\n|multiply_func(x, x)|\n+-------------------+\n|   
               1|\n|                  4|\n|                  
9|\n+-------------------+\n\n"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {
+        "jobUrl": {
+          "propertyName": "jobUrl",
+          "label": "SPARK JOB",
+          "tooltip": "View in Spark web UI",
+          "group": "spark",
+          "values": [
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d1";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d2";
+            }
+          ],
+          "interpreterSettingId": "spark"
+        }
+      },
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499530530_1328752796",
+      "id": "paragraph_1628499530530_1328752796",
+      "dateCreated": "2021-08-09 16:58:50.530",
+      "dateStarted": "2021-08-09 20:28:27.984",
+      "dateFinished": "2021-08-09 20:28:29.754",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Iterator of Series to Iterator of Series",
+      "text": "%md\n\nThe type hint can be expressed as 
`Iterator[pandas.Series]` -\u003e `Iterator[pandas.Series]`.\n\nBy using 
`pandas_udf()` with the function having such type hints above, it creates a 
Pandas UDF where the given function takes an iterator of pandas.Series and 
outputs an iterator of `pandas.Series`. The length of the entire output from 
the function should be the same length of the entire input; therefore, it can 
prefetch the data from the input iterator as long as the  [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:29.785",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003eThe type hint can be expressed 
as \u003ccode\u003eIterator[pandas.Series]\u003c/code\u003e -\u0026gt; 
\u003ccode\u003eIterator[pandas.Series]\u003c/code\u003e.\u003c/p\u003e\n\u003cp\u003eBy
 using \u003ccode\u003epandas_udf()\u003c/code\u003e with the function having 
such type hints above, it creates a Pandas UDF where the given function takes 
an iterator of pandas.Series and outputs an iterator of \u003ccode [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628503263767_1381148085",
+      "id": "paragraph_1628503263767_1381148085",
+      "dateCreated": "2021-08-09 18:01:03.767",
+      "dateStarted": "2021-08-09 20:28:29.788",
+      "dateFinished": "2021-08-09 20:28:29.798",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Iterator of Series to Iterator of Series",
+      "text": "%spark.pyspark\n\nfrom typing import Iterator\n\nimport pandas 
as pd\n\nfrom pyspark.sql.functions import pandas_udf\n\npdf \u003d 
pd.DataFrame([1, 2, 3], columns\u003d[\"x\"])\ndf \u003d 
spark.createDataFrame(pdf)\n\n# Declare the function and create the 
UDF\n@pandas_udf(\"long\")\ndef plus_one(iterator: Iterator[pd.Series]) -\u003e 
Iterator[pd.Series]:\n    for x in iterator:\n        yield x + 
1\n\ndf.select(plus_one(\"x\")).show()",
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:29.888",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "title": false,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "+-----------+\n|plus_one(x)|\n+-----------+\n|          
2|\n|          3|\n|          4|\n+-----------+\n\n"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {
+        "jobUrl": {
+          "propertyName": "jobUrl",
+          "label": "SPARK JOB",
+          "tooltip": "View in Spark web UI",
+          "group": "spark",
+          "values": [
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d3";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d4";
+            }
+          ],
+          "interpreterSettingId": "spark"
+        }
+      },
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499052505_1336286916",
+      "id": "paragraph_1624351615156_2079208031",
+      "dateCreated": "2021-08-09 16:50:52.505",
+      "dateStarted": "2021-08-09 20:28:29.891",
+      "dateFinished": "2021-08-09 20:28:30.361",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Iterator of Multiple Series to Iterator of Series",
+      "text": "%md\n\nThe type hint can be expressed as 
`Iterator[Tuple[pandas.Series, ...]]` -\u003e `Iterator[pandas.Series]`.\n\nBy 
using `pandas_udf()` with the function having such type hints above, it creates 
a Pandas UDF where the given function takes an iterator of a tuple of multiple 
pandas.Series and outputs an iterator of` pandas.Series`. In this case, the 
created pandas UDF requires multiple input columns as many as the series in the 
tuple when the Pandas UDF is called. Other [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:30.391",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003eThe type hint can be expressed 
as \u003ccode\u003eIterator[Tuple[pandas.Series, ...]]\u003c/code\u003e 
-\u0026gt; 
\u003ccode\u003eIterator[pandas.Series]\u003c/code\u003e.\u003c/p\u003e\n\u003cp\u003eBy
 using \u003ccode\u003epandas_udf()\u003c/code\u003e with the function having 
such type hints above, it creates a Pandas UDF where the given function takes 
an iterator of a tuple of multiple pandas.Series and o [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628503345832_1007259088",
+      "id": "paragraph_1628503345832_1007259088",
+      "dateCreated": "2021-08-09 18:02:25.832",
+      "dateStarted": "2021-08-09 20:28:30.395",
+      "dateFinished": "2021-08-09 20:28:30.402",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Iterator of Multiple Series to Iterator of Series",
+      "text": "%spark.pyspark\n\nfrom typing import Iterator, Tuple\n\nimport 
pandas as pd\n\nfrom pyspark.sql.functions import pandas_udf\n\npdf \u003d 
pd.DataFrame([1, 2, 3], columns\u003d[\"x\"])\ndf \u003d 
spark.createDataFrame(pdf)\n\n# Declare the function and create the 
UDF\n@pandas_udf(\"long\")\ndef multiply_two_cols(\n        iterator: 
Iterator[Tuple[pd.Series, pd.Series]]) -\u003e Iterator[pd.Series]:\n    for a, 
b in iterator:\n        yield a * b\n\ndf.select(multiply_two_co [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:30.495",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "title": false,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "+-----------------------+\n|multiply_two_cols(x, 
x)|\n+-----------------------+\n|                      1|\n|                    
  4|\n|                      9|\n+-----------------------+\n\n"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {
+        "jobUrl": {
+          "propertyName": "jobUrl",
+          "label": "SPARK JOB",
+          "tooltip": "View in Spark web UI",
+          "group": "spark",
+          "values": [
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d5";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d6";
+            }
+          ],
+          "interpreterSettingId": "spark"
+        }
+      },
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499631390_1986337647",
+      "id": "paragraph_1628499631390_1986337647",
+      "dateCreated": "2021-08-09 17:00:31.390",
+      "dateStarted": "2021-08-09 20:28:30.498",
+      "dateFinished": "2021-08-09 20:28:32.071",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Series to Scalar",
+      "text": "%md\n\nThe type hint can be expressed as `pandas.Series`, … 
-\u003e `Any`.\n\nBy using `pandas_udf()` with the function having such type 
hints above, it creates a Pandas UDF similar to PySpark’s aggregate functions. 
The given function takes pandas.Series and returns a scalar value. The return 
type should be a primitive data type, and the returned scalar can be either a 
python primitive type, e.g., int or float or a numpy data type, e.g., 
numpy.int64 or numpy.float64. Any s [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:32.098",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003eThe type hint can be expressed 
as \u003ccode\u003epandas.Series\u003c/code\u003e, … -\u0026gt; 
\u003ccode\u003eAny\u003c/code\u003e.\u003c/p\u003e\n\u003cp\u003eBy using 
\u003ccode\u003epandas_udf()\u003c/code\u003e with the function having such 
type hints above, it creates a Pandas UDF similar to PySpark’s aggregate 
functions. The given function takes pandas.Series and returns a scalar value. 
The return type [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628503394877_382217858",
+      "id": "paragraph_1628503394877_382217858",
+      "dateCreated": "2021-08-09 18:03:14.877",
+      "dateStarted": "2021-08-09 20:28:32.101",
+      "dateFinished": "2021-08-09 20:28:32.109",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Series to Scalar",
+      "text": "%spark.pyspark\n\nimport pandas as pd\n\nfrom 
pyspark.sql.functions import pandas_udf\nfrom pyspark.sql import Window\n\ndf 
\u003d spark.createDataFrame(\n    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 
10.0)],\n    (\"id\", \"v\"))\n\n# Declare the function and create the 
UDF\n@pandas_udf(\"double\")\ndef mean_udf(v: pd.Series) -\u003e float:\n    
return 
v.mean()\n\ndf.select(mean_udf(df[\u0027v\u0027])).show()\n\n\ndf.groupby(\"id\").agg(mean_udf(df[\u0027v\u0027])).sho
 [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:32.201",
+      "progress": 88,
+      "config": {
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "title": false,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "+-----------+\n|mean_udf(v)|\n+-----------+\n|        
4.2|\n+-----------+\n\n+---+-----------+\n| 
id|mean_udf(v)|\n+---+-----------+\n|  1|        1.5|\n|  2|        
6.0|\n+---+-----------+\n\n+---+----+------+\n| id|   
v|mean_v|\n+---+----+------+\n|  1| 1.0|   1.5|\n|  1| 2.0|   1.5|\n|  2| 3.0|  
 6.0|\n|  2| 5.0|   6.0|\n|  2|10.0|   6.0|\n+---+----+------+\n\n"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {
+        "jobUrl": {
+          "propertyName": "jobUrl",
+          "label": "SPARK JOB",
+          "tooltip": "View in Spark web UI",
+          "group": "spark",
+          "values": [
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d7";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d8";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d9";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d10";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d11";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d12";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d13";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d14";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d15";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d16";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d17";
+            }
+          ],
+          "interpreterSettingId": "spark"
+        }
+      },
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499645163_1973652977",
+      "id": "paragraph_1628499645163_1973652977",
+      "dateCreated": "2021-08-09 17:00:45.163",
+      "dateStarted": "2021-08-09 20:28:32.204",
+      "dateFinished": "2021-08-09 20:28:42.919",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Pandas Function APIs",
+      "text": "%md\n\nPandas Function APIs can directly apply a Python native 
function against the whole DataFrame by using Pandas instances. Internally it 
works similarly with Pandas UDFs by using Arrow to transfer data and Pandas to 
work with the data, which allows vectorized operations. However, a Pandas 
Function API behaves as a regular API under PySpark DataFrame instead of 
Column, and Python type hints in Pandas Functions APIs are optional and do not 
affect how it works internally  [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:43.011",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003ePandas Function APIs can 
directly apply a Python native function against the whole DataFrame by using 
Pandas instances. Internally it works similarly with Pandas UDFs by using Arrow 
to transfer data and Pandas to work with the data, which allows vectorized 
operations. However, a Pandas Function API behaves as a regular API under 
PySpark DataFrame instead of Column, and Python type hints in Pandas Functions 
AP [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628503449747_446542293",
+      "id": "paragraph_1628503449747_446542293",
+      "dateCreated": "2021-08-09 18:04:09.747",
+      "dateStarted": "2021-08-09 20:28:43.014",
+      "dateFinished": "2021-08-09 20:28:43.025",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Grouped Map",
+      "text": "%md\n\nGrouped map operations with Pandas instances are 
supported by `DataFrame.groupby().applyInPandas()` which requires a Python 
function that takes a `pandas.DataFrame` and return another `pandas.DataFrame`. 
It maps each group to each pandas.DataFrame in the Python function.\n\nThis API 
implements the “split-apply-combine” pattern which consists of three 
steps:\n\n* Split the data into groups by using `DataFrame.groupBy()`.\n* Apply 
a function on each group. The input a [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:43.114",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003eGrouped map operations with 
Pandas instances are supported by 
\u003ccode\u003eDataFrame.groupby().applyInPandas()\u003c/code\u003e which 
requires a Python function that takes a 
\u003ccode\u003epandas.DataFrame\u003c/code\u003e and return another 
\u003ccode\u003epandas.DataFrame\u003c/code\u003e. It maps each group to each 
pandas.DataFrame in the Python function.\u003c/p\u003e\n\u003cp\u003eThis API 
implements [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628503542685_1593420516",
+      "id": "paragraph_1628503542685_1593420516",
+      "dateCreated": "2021-08-09 18:05:42.685",
+      "dateStarted": "2021-08-09 20:28:43.117",
+      "dateFinished": "2021-08-09 20:28:43.127",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Grouped Map",
+      "text": "%spark.pyspark\n\ndf \u003d spark.createDataFrame(\n    [(1, 
1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],\n    (\"id\", \"v\"))\n\ndef 
subtract_mean(pdf):\n    # pdf is a pandas.DataFrame\n    v \u003d pdf.v\n    
return pdf.assign(v\u003dv - 
v.mean())\n\ndf.groupby(\"id\").applyInPandas(subtract_mean, schema\u003d\"id 
long, v double\").show()",
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:43.217",
+      "progress": 75,
+      "config": {
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "title": false,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "+---+----+\n| id|   v|\n+---+----+\n|  1|-0.5|\n|  1| 
0.5|\n|  2|-3.0|\n|  2|-1.0|\n|  2| 4.0|\n+---+----+\n\n"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {
+        "jobUrl": {
+          "propertyName": "jobUrl",
+          "label": "SPARK JOB",
+          "tooltip": "View in Spark web UI",
+          "group": "spark",
+          "values": [
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d18";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d19";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d20";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d21";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d22";
+            }
+          ],
+          "interpreterSettingId": "spark"
+        }
+      },
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499671399_1794474062",
+      "id": "paragraph_1628499671399_1794474062",
+      "dateCreated": "2021-08-09 17:01:11.399",
+      "dateStarted": "2021-08-09 20:28:43.220",
+      "dateFinished": "2021-08-09 20:28:44.605",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Map",
+      "text": "%md\n\nMap operations with Pandas instances are supported by 
`DataFrame.mapInPandas()` which maps an iterator of pandas.DataFrames to 
another iterator of `pandas.DataFrames` that represents the current PySpark 
DataFrame and returns the result as a PySpark DataFrame. The function takes and 
outputs an iterator of `pandas.DataFrame`. It can return the output of 
arbitrary length in contrast to some Pandas UDFs although internally it works 
similarly with Series to Series Pandas [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:44.621",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cp\u003eMap operations with Pandas 
instances are supported by 
\u003ccode\u003eDataFrame.mapInPandas()\u003c/code\u003e which maps an iterator 
of pandas.DataFrames to another iterator of 
\u003ccode\u003epandas.DataFrames\u003c/code\u003e that represents the current 
PySpark DataFrame and returns the result as a PySpark DataFrame. The function 
takes and outputs an iterator of \u003ccode\u003epandas.DataFrame\u003c/code\ 
[...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628502659243_294355457",
+      "id": "paragraph_1628502659243_294355457",
+      "dateCreated": "2021-08-09 17:50:59.243",
+      "dateStarted": "2021-08-09 20:28:44.624",
+      "dateFinished": "2021-08-09 20:28:44.630",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Map",
+      "text": "%spark.pyspark\n\ndf \u003d spark.createDataFrame([(1, 21), (2, 
30)], (\"id\", \"age\"))\n\ndef filter_func(iterator):\n    for pdf in 
iterator:\n        yield pdf[pdf.id \u003d\u003d 
1]\n\ndf.mapInPandas(filter_func, schema\u003ddf.schema).show()",
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:44.724",
+      "progress": 0,
+      "config": {
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "title": false,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "+---+---+\n| id|age|\n+---+---+\n|  1| 21|\n+---+---+\n\n"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {
+        "jobUrl": {
+          "propertyName": "jobUrl",
+          "label": "SPARK JOB",
+          "tooltip": "View in Spark web UI",
+          "group": "spark",
+          "values": [
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d23";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d24";
+            }
+          ],
+          "interpreterSettingId": "spark"
+        }
+      },
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499682627_2106140471",
+      "id": "paragraph_1628499682627_2106140471",
+      "dateCreated": "2021-08-09 17:01:22.627",
+      "dateStarted": "2021-08-09 20:28:44.729",
+      "dateFinished": "2021-08-09 20:28:46.155",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Co-grouped Map",
+      "text": "%md\n\n\u003cbr/\u003e\n\nCo-grouped map operations with Pandas 
instances are supported by `DataFrame.groupby().cogroup().applyInPandas()` 
which allows two PySpark DataFrames to be cogrouped by a common key and then a 
Python function applied to each cogroup. It consists of the following 
steps:\n\n* Shuffle the data such that the groups of each dataframe which share 
a key are cogrouped together.\n* Apply a function to each cogroup. The input of 
the function is two `pandas.D [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:46.229",
+      "progress": 0,
+      "config": {
+        "tableHide": false,
+        "editorSetting": {
+          "language": "markdown",
+          "editOnDblClick": true,
+          "completionKey": "TAB",
+          "completionSupport": false
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/markdown",
+        "fontSize": 9.0,
+        "editorHide": true,
+        "title": true,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "HTML",
+            "data": "\u003cdiv 
class\u003d\"markdown-body\"\u003e\n\u003cbr/\u003e\n\u003cp\u003eCo-grouped 
map operations with Pandas instances are supported by 
\u003ccode\u003eDataFrame.groupby().cogroup().applyInPandas()\u003c/code\u003e 
which allows two PySpark DataFrames to be cogrouped by a common key and then a 
Python function applied to each cogroup. It consists of the following 
steps:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eShuffle the data such that 
the groups of each data [...]
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628502751727_153024564",
+      "id": "paragraph_1628502751727_153024564",
+      "dateCreated": "2021-08-09 17:52:31.727",
+      "dateStarted": "2021-08-09 20:28:46.233",
+      "dateFinished": "2021-08-09 20:28:46.242",
+      "status": "FINISHED"
+    },
+    {
+      "title": "Co-grouped Map",
+      "text": "%spark.pyspark\n\nimport pandas as pd\n\ndf1 \u003d 
spark.createDataFrame(\n    [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 
1, 3.0), (20000102, 2, 4.0)],\n    (\"time\", \"id\", \"v1\"))\n\ndf2 \u003d 
spark.createDataFrame(\n    [(20000101, 1, \"x\"), (20000101, 2, \"y\")],\n    
(\"time\", \"id\", \"v2\"))\n\ndef asof_join(l, r):\n    return 
pd.merge_asof(l, r, on\u003d\"time\", 
by\u003d\"id\")\n\ndf1.groupby(\"id\").cogroup(df2.groupby(\"id\")).applyInPandas(\n
    [...]
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:46.332",
+      "progress": 22,
+      "config": {
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        },
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "title": false,
+        "results": {},
+        "enabled": true
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": [
+          {
+            "type": "TEXT",
+            "data": "+--------+---+---+---+\n|    time| id| v1| 
v2|\n+--------+---+---+---+\n|20000101|  1|1.0|  x|\n|20000102|  1|3.0|  
x|\n|20000101|  2|2.0|  y|\n|20000102|  2|4.0|  y|\n+--------+---+---+---+\n\n"
+          }
+        ]
+      },
+      "apps": [],
+      "runtimeInfos": {
+        "jobUrl": {
+          "propertyName": "jobUrl",
+          "label": "SPARK JOB",
+          "tooltip": "View in Spark web UI",
+          "group": "spark",
+          "values": [
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d25";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d26";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d27";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d28";
+            },
+            {
+              "jobUrl": 
"http://emr-worker-2.cluster-46718:37989/jobs/job?id\u003d29";
+            }
+          ],
+          "interpreterSettingId": "spark"
+        }
+      },
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628499694411_1813984093",
+      "id": "paragraph_1628499694411_1813984093",
+      "dateCreated": "2021-08-09 17:01:34.411",
+      "dateStarted": "2021-08-09 20:28:46.335",
+      "dateFinished": "2021-08-09 20:28:48.012",
+      "status": "FINISHED"
+    },
+    {
+      "text": "%spark.pyspark\n",
+      "user": "anonymous",
+      "dateUpdated": "2021-08-09 20:28:48.036",
+      "progress": 0,
+      "config": {
+        "colWidth": 12.0,
+        "editorMode": "ace/mode/python",
+        "fontSize": 9.0,
+        "results": {},
+        "enabled": true,
+        "editorSetting": {
+          "language": "python",
+          "editOnDblClick": false,
+          "completionKey": "TAB",
+          "completionSupport": true
+        }
+      },
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "results": {
+        "code": "SUCCESS",
+        "msg": []
+      },
+      "apps": [],
+      "runtimeInfos": {},
+      "progressUpdateIntervalMs": 500,
+      "jobName": "paragraph_1628502158993_661405207",
+      "id": "paragraph_1628502158993_661405207",
+      "dateCreated": "2021-08-09 17:42:38.993",
+      "dateStarted": "2021-08-09 20:28:48.040",
+      "dateFinished": "2021-08-09 20:28:48.261",
+      "status": "FINISHED"
+    }
+  ],
+  "name": "8. PySpark Conda Env in Yarn Mode",
+  "id": "2GE79Y5FV",
+  "defaultInterpreterGroup": "spark",
+  "version": "0.10.0-SNAPSHOT",
+  "noteParams": {},
+  "noteForms": {},
+  "angularObjects": {},
+  "config": {
+    "personalizedMode": "false",
+    "looknfeel": "default",
+    "isZeppelinNotebookCronEnable": false
+  },
+  "info": {
+    "isRunning": true
+  }
+}
\ No newline at end of file

[zeppelin] branch master updated: [ZEPPELIN-5483] Update spark doc

Reply via email to