[kylin] branch document updated: update spark doc

shaofengshi Thu, 17 May 2018 20:10:32 -0700

This is an automated email from the ASF dual-hosted git repository.

shaofengshi pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git



The following commit(s) were added to refs/heads/document by this push:
     new abda369  update spark doc
abda369 is described below

commit abda3691b0210b30dd2a03079669bf3be7d932e9
Author: shaofengshi <shaofeng...@apache.org>
AuthorDate: Fri May 18 11:09:30 2018 +0800

    update spark doc
---
 website/_docs23/tutorial/spark.md | 152 ++++++--------------------------------
 1 file changed, 23 insertions(+), 129 deletions(-)

diff --git a/website/_docs23/tutorial/spark.md 
b/website/_docs23/tutorial/spark.md
index 6d599ca..3553cf0 100644
--- a/website/_docs23/tutorial/spark.md
+++ b/website/_docs23/tutorial/spark.md
@@ -8,184 +8,78 @@ permalink: /docs23/tutorial/spark.html
 
 ### Introduction
 
-Kylin provides JDBC driver to query the Cube data. Spark can query SQL 
databases using JDBC driver. With this, you can query Kylin's Cube from Spark 
and then do the analysis.
+Kylin provides JDBC driver to query the Cube data. Spark can query SQL 
databases using JDBC driver. With this, you can query Kylin's Cube from Spark 
and then do the analysis over a very huge data set.
 
-But, Kylin is an OLAP system, it is not a real database: Kylin only has 
aggregated data, no raw data. If you simply load the source table into Spark as 
a data frame, some operations like "count" might be wrong if you expect to 
count the raw data. 
+But, Kylin is an OLAP system, it is not a real database: Kylin only has 
aggregated data, no raw data. If you simply load the source table into Spark as 
a data frame, it may not work as the Cube data can be very huge, and some 
operations like "count" might be wrong.
 
-Besides, the Cube data can be very huge which is different with normal 
database. 
+This document describes how to use Kylin as a data source in Apache Spark. You 
need to install Kylin, build a Cube, and then put Kylin's JDBC driver onto your 
Spark application's classpath. 
 
-This document describes how to use Kylin as a data source in Apache Spark. You 
need install Kylin and build a Cube as the prerequisite. 
+### The wrong way
 
-### The wrong application
-
-The below Python application tries to load Kylin's table as a data frame, and 
then expect to get the total row count with "df.count()", but the result is 
incorrect.
+The below Python application tries to directly load Kylin's table as a data 
frame, and then expect to get the total row count with "df.count()", but the 
result is incorrect.
 
 {% highlight Groff markup %}
-#!/usr/bin/env python
-
-import os
-import sys
-import traceback
-import time
-import subprocess
-import json
-import re
-
-os.environ["SPARK_HOME"] = "/usr/local/spark/"
-sys.path.append(os.environ["SPARK_HOME"]+"/python")
-
-from pyspark import SparkConf, SparkContext
-from pyspark.sql import SQLContext
-
-from pyspark.sql.functions import *
-from pyspark.sql.types import *
-
-jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
-
-class Kap(object):
-    def __init__(self):
-        print 'initializing Spark context ...'
-        sys.stdout.flush()
-
-        conf = SparkConf() 
-        conf.setMaster('yarn')
-        conf.setAppName('kap test')
-
-        wdir = os.path.dirname(os.path.realpath(__file__))
-        jars_with_path = ','.join([wdir + '/' + x for x in jars])
-
-        conf.set("spark.jars", jars_with_path)
-        conf.set("spark.yarn.archive", 
"hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
-        conf.set("spark.driver.extraClassPath", 
jars_with_path.replace(",",":"))
-
-        self.sc = SparkContext(conf=conf)
-        self.sqlContext = SQLContext(self.sc)
-        print 'Spark context is initialized'
-
-        self.df = self.sqlContext.read.format('jdbc').options(
-            url='jdbc:kylin://sandbox:7070/default',
-            user='ADMIN', password='KYLIN',
-            dbtable='test_kylin_fact', 
driver='org.apache.kylin.jdbc.Driver').load()
-
-        self.df.registerTempTable("loltab")
-        print self.df.count()
-
-    def sql(self, cmd, result_tab_name='tmptable'):
-        df = self.sqlContext.sql(cmd) 
-        if df is not None:
-            df.registerTempTable(result_tab_name)
-        return df
+    conf = SparkConf() 
+    conf.setMaster('yarn')
+    conf.setAppName('Kylin jdbc example')
 
-    def stop(self):
-        self.sc.stop()
+    self.sc = SparkContext(conf=conf)
+    self.sqlContext = SQLContext(self.sc)
 
-kap = Kap()
-try:
-    df = kap.sql(r"select count(*) from loltab")
-    df.show(truncate=False)
-except:
-    pass
-finally:
-    kap.stop()
+    self.df = self.sqlContext.read.format('jdbc').options(
+        url='jdbc:kylin://sandbox:7070/default',
+        user='ADMIN', password='KYLIN',
+        dbtable='kylin_sales', driver='org.apache.kylin.jdbc.Driver').load()
 
+    print self.df.count()
 
+    
 {% endhighlight %}
 
 The output is:
 {% highlight Groff markup %}
-Spark context is initialized
 132
-+--------+
-|count(1)|
-+--------+
-|132     |
-+--------+
 
 {% endhighlight %}
 
 
-The result "132" here is not the total count of the origin table. The reason 
is that, Spark sends "select * from " query to Kylin, Kylin doesn't have the 
raw data, but will answer the query with aggregated data in the base Cuboid. 
The "132" is the row number of the base Cuboid, not source data. 
+The result "132" here is not the total count of the origin table. The reason 
is that, Spark sends "select * " or "select 1 " query to Kylin, Kylin doesn't 
have the raw data, but will answer the query with aggregated data in the base 
Cuboid. The "132" is the row number of the base Cuboid, not original data. 
 
 
-### The right code
+### The right way
 
-The right behavior is to push down the aggregation to Kylin, so that the Cube 
can be leveraged. Below is the correct code:
+The right behavior is to push down all possible aggregations to Kylin, so that 
the Cube can be leveraged, the performance would be much better than from 
source data. Below is the correct code:
 
 {% highlight Groff markup %}
-#!/usr/bin/env python
-
-import os
-import sys
-import json
-
-os.environ["SPARK_HOME"] = "/usr/local/spark/"
-sys.path.append(os.environ["SPARK_HOME"]+"/python")
-
-from pyspark import SparkConf, SparkContext
-from pyspark.sql import SQLContext
-
-from pyspark.sql.functions import *
-from pyspark.sql.types import *
-
-jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
-
 
-def demo():
-    # step 1: init
-    print 'initializing ...',
     conf = SparkConf() 
     conf.setMaster('yarn')
-    conf.setAppName('jdbc example')
-
-    wdir = os.path.dirname(os.path.realpath(__file__))
-    jars_with_path = ','.join([wdir + '/' + x for x in jars])
-
-    conf.set("spark.jars", jars_with_path)
-    conf.set("spark.yarn.archive", 
"hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
-        
-    conf.set("spark.driver.extraClassPath", jars_with_path.replace(",",":"))
+    conf.setAppName('Kylin jdbc example')
 
     sc = SparkContext(conf=conf)
     sql_ctx = SQLContext(sc)
-    print 'done'
-
+  
     url='jdbc:kylin://sandbox:7070/default'
-    tab_name = '(select count(*) as total from test_kylin_fact) the_alias'
+    tab_name = '(select count(*) as total from kylin_sales) the_alias'
 
-    # step 2: initiate the sql
     df = sql_ctx.read.format('jdbc').options(
             url=url, user='ADMIN', password='KYLIN',
             driver='org.apache.kylin.jdbc.Driver',
             dbtable=tab_name).load()
 
-    # many ways to obtain the results
     df.show()
 
-    print "df.count()", df.count()  # must be 1, as there is only one row
-
-    for record in df.toJSON().collect():
-        # this loop has only one iteration
-        # reach record is a string; need to be decoded to JSON
-        print 'the total column: ', json.loads(record)['TOTAL']
-
-    sc.stop()
-
-demo()
-
 {% endhighlight %}
 
-Here is the output, which is expected:
+Here is the output, the result is correct as Spark push down the aggregation 
to Kylin:
 
 {% highlight Groff markup %}
-initializing ... done
 +-----+
 |TOTAL|
 +-----+
 | 2000|
 +-----+
 
-df.count() 1
-the total column:  2000
 {% endhighlight %}
 
 Thanks for the input and sample code from Shuxin Yang 
(shuxinyang....@gmail.com).

-- 
To stop receiving notification emails like this one, please contact
shaofeng...@apache.org.

[kylin] branch document updated: update spark doc

Reply via email to