Repository: spark Updated Branches: refs/heads/master 2550533a2 -> b52603b03
[SPARK-2013] Documentation for saveAsPickleFile and pickleFile in Python Author: Kan Zhang <[email protected]> Closes #983 from kanzhang/SPARK-2013 and squashes the following commits: 0e128bb [Kan Zhang] [SPARK-2013] minor update e728516 [Kan Zhang] [SPARK-2013] Documentation for saveAsPickleFile and pickleFile in Python Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b52603b0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b52603b0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b52603b0 Branch: refs/heads/master Commit: b52603b039cdfa0f8e58ef3c6229d79e732ffc58 Parents: 2550533 Author: Kan Zhang <[email protected]> Authored: Sat Jun 14 13:22:30 2014 -0700 Committer: Reynold Xin <[email protected]> Committed: Sat Jun 14 13:22:30 2014 -0700 ---------------------------------------------------------------------- docs/programming-guide.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/b52603b0/docs/programming-guide.md ---------------------------------------------------------------------- diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 7978468..ef0c0e3 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -377,13 +377,15 @@ Some notes on reading files with Spark: * The `textFile` method also takes an optional second argument for controlling the number of slices of the file. By default, Spark creates one slice for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of slices by passing a larger value. Note that you cannot have fewer slices than blocks. -Apart from reading files as a collection of lines, -`SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. +Apart from text files, Spark's Python API also supports several other data formats: -### SequenceFile and Hadoop InputFormats +* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. + +* `RDD.saveAsPickleFile` and `SparkContext.pickleFile` support saving an RDD in a simple format consisting of pickled Python objects. Batching is used on pickle serialization, with default batch size 10. -In addition to reading text files, PySpark supports reading ```SequenceFile``` -and any arbitrary ```InputFormat```. +* Details on reading `SequenceFile` and arbitrary Hadoop `InputFormat` are given below. + +### SequenceFile and Hadoop InputFormats **Note** this feature is currently marked ```Experimental``` and is intended for advanced users. It may be replaced in future with read/write support based on SparkSQL, in which case SparkSQL is the preferred approach.
