This is an automated email from the ASF dual-hosted git repository.
shaofengshi pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git
The following commit(s) were added to refs/heads/document by this push:
new 1553900 KYLIN-3383 add document for Spark JDBC
1553900 is described below
commit 155390017f298e2201f8c94812b0b9bca676bffe
Author: shaofengshi <[email protected]>
AuthorDate: Thu May 17 18:55:15 2018 +0800
KYLIN-3383 add document for Spark JDBC
---
website/_data/docs23.yml | 1 +
website/_docs23/index.md | 5 +-
website/_docs23/tutorial/spark.md | 191 ++++++++++++++++++++++++++++++++++++++
3 files changed, 195 insertions(+), 2 deletions(-)
diff --git a/website/_data/docs23.yml b/website/_data/docs23.yml
index 66b3121..97a0a6e 100644
--- a/website/_data/docs23.yml
+++ b/website/_data/docs23.yml
@@ -56,6 +56,7 @@
- tutorial/microstrategy
- tutorial/squirrel
- tutorial/flink
+ - tutorial/spark
- tutorial/hue
- tutorial/Qlik
diff --git a/website/_docs23/index.md b/website/_docs23/index.md
index 0ad0af7..6d99ee1 100644
--- a/website/_docs23/index.md
+++ b/website/_docs23/index.md
@@ -56,8 +56,9 @@ Connectivity and APIs
8. [Connect from MicroStrategy](tutorial/microstrategy.html)
9. [Connect from SQuirreL](tutorial/squirrel.html)
10. [Connect from Apache Flink](tutorial/flink.html)
-11. [Connect from Hue](tutorial/hue.html)
-12. [Connect from Qlik Sense](tutorial/Qlik.html)
+11. [Connect from Apache Spark](tutorial/spark.html)
+12. [Connect from Hue](tutorial/hue.html)
+13. [Connect from Qlik Sense](tutorial/Qlik.html)
Operations
diff --git a/website/_docs23/tutorial/spark.md
b/website/_docs23/tutorial/spark.md
new file mode 100644
index 0000000..6d599ca
--- /dev/null
+++ b/website/_docs23/tutorial/spark.md
@@ -0,0 +1,191 @@
+---
+layout: docs23
+title: Apache Spark
+categories: tutorial
+permalink: /docs23/tutorial/spark.html
+---
+
+
+### Introduction
+
+Kylin provides JDBC driver to query the Cube data. Spark can query SQL
databases using JDBC driver. With this, you can query Kylin's Cube from Spark
and then do the analysis.
+
+But, Kylin is an OLAP system, it is not a real database: Kylin only has
aggregated data, no raw data. If you simply load the source table into Spark as
a data frame, some operations like "count" might be wrong if you expect to
count the raw data.
+
+Besides, the Cube data can be very huge which is different with normal
database.
+
+This document describes how to use Kylin as a data source in Apache Spark. You
need install Kylin and build a Cube as the prerequisite.
+
+### The wrong application
+
+The below Python application tries to load Kylin's table as a data frame, and
then expect to get the total row count with "df.count()", but the result is
incorrect.
+
+{% highlight Groff markup %}
+#!/usr/bin/env python
+
+import os
+import sys
+import traceback
+import time
+import subprocess
+import json
+import re
+
+os.environ["SPARK_HOME"] = "/usr/local/spark/"
+sys.path.append(os.environ["SPARK_HOME"]+"/python")
+
+from pyspark import SparkConf, SparkContext
+from pyspark.sql import SQLContext
+
+from pyspark.sql.functions import *
+from pyspark.sql.types import *
+
+jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
+
+class Kap(object):
+ def __init__(self):
+ print 'initializing Spark context ...'
+ sys.stdout.flush()
+
+ conf = SparkConf()
+ conf.setMaster('yarn')
+ conf.setAppName('kap test')
+
+ wdir = os.path.dirname(os.path.realpath(__file__))
+ jars_with_path = ','.join([wdir + '/' + x for x in jars])
+
+ conf.set("spark.jars", jars_with_path)
+ conf.set("spark.yarn.archive",
"hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
+ conf.set("spark.driver.extraClassPath",
jars_with_path.replace(",",":"))
+
+ self.sc = SparkContext(conf=conf)
+ self.sqlContext = SQLContext(self.sc)
+ print 'Spark context is initialized'
+
+ self.df = self.sqlContext.read.format('jdbc').options(
+ url='jdbc:kylin://sandbox:7070/default',
+ user='ADMIN', password='KYLIN',
+ dbtable='test_kylin_fact',
driver='org.apache.kylin.jdbc.Driver').load()
+
+ self.df.registerTempTable("loltab")
+ print self.df.count()
+
+ def sql(self, cmd, result_tab_name='tmptable'):
+ df = self.sqlContext.sql(cmd)
+ if df is not None:
+ df.registerTempTable(result_tab_name)
+ return df
+
+ def stop(self):
+ self.sc.stop()
+
+kap = Kap()
+try:
+ df = kap.sql(r"select count(*) from loltab")
+ df.show(truncate=False)
+except:
+ pass
+finally:
+ kap.stop()
+
+
+{% endhighlight %}
+
+The output is:
+{% highlight Groff markup %}
+Spark context is initialized
+132
++--------+
+|count(1)|
++--------+
+|132 |
++--------+
+
+{% endhighlight %}
+
+
+The result "132" here is not the total count of the origin table. The reason
is that, Spark sends "select * from " query to Kylin, Kylin doesn't have the
raw data, but will answer the query with aggregated data in the base Cuboid.
The "132" is the row number of the base Cuboid, not source data.
+
+
+### The right code
+
+The right behavior is to push down the aggregation to Kylin, so that the Cube
can be leveraged. Below is the correct code:
+
+{% highlight Groff markup %}
+#!/usr/bin/env python
+
+import os
+import sys
+import json
+
+os.environ["SPARK_HOME"] = "/usr/local/spark/"
+sys.path.append(os.environ["SPARK_HOME"]+"/python")
+
+from pyspark import SparkConf, SparkContext
+from pyspark.sql import SQLContext
+
+from pyspark.sql.functions import *
+from pyspark.sql.types import *
+
+jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
+
+
+def demo():
+ # step 1: init
+ print 'initializing ...',
+ conf = SparkConf()
+ conf.setMaster('yarn')
+ conf.setAppName('jdbc example')
+
+ wdir = os.path.dirname(os.path.realpath(__file__))
+ jars_with_path = ','.join([wdir + '/' + x for x in jars])
+
+ conf.set("spark.jars", jars_with_path)
+ conf.set("spark.yarn.archive",
"hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
+
+ conf.set("spark.driver.extraClassPath", jars_with_path.replace(",",":"))
+
+ sc = SparkContext(conf=conf)
+ sql_ctx = SQLContext(sc)
+ print 'done'
+
+ url='jdbc:kylin://sandbox:7070/default'
+ tab_name = '(select count(*) as total from test_kylin_fact) the_alias'
+
+ # step 2: initiate the sql
+ df = sql_ctx.read.format('jdbc').options(
+ url=url, user='ADMIN', password='KYLIN',
+ driver='org.apache.kylin.jdbc.Driver',
+ dbtable=tab_name).load()
+
+ # many ways to obtain the results
+ df.show()
+
+ print "df.count()", df.count() # must be 1, as there is only one row
+
+ for record in df.toJSON().collect():
+ # this loop has only one iteration
+ # reach record is a string; need to be decoded to JSON
+ print 'the total column: ', json.loads(record)['TOTAL']
+
+ sc.stop()
+
+demo()
+
+{% endhighlight %}
+
+Here is the output, which is expected:
+
+{% highlight Groff markup %}
+initializing ... done
++-----+
+|TOTAL|
++-----+
+| 2000|
++-----+
+
+df.count() 1
+the total column: 2000
+{% endhighlight %}
+
+Thanks for the input and sample code from Shuxin Yang
([email protected]).
--
To stop receiving notification emails like this one, please contact
[email protected].