Repository: spark Updated Branches: refs/heads/master 024482bf5 -> e298ac91e
[SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the fpm and recommendation modules. Closes #10602 Closes #10897 Author: Bryan Cutler <[email protected]> Author: somideshmukh <[email protected]> Closes #11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e298ac91 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e298ac91 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e298ac91 Branch: refs/heads/master Commit: e298ac91e3f6177c6da83e2d8ee994d9037466da Parents: 024482b Author: Bryan Cutler <[email protected]> Authored: Mon Feb 22 12:48:37 2016 +0200 Committer: Nick Pentreath <[email protected]> Committed: Mon Feb 22 12:48:37 2016 +0200 ---------------------------------------------------------------------- docs/mllib-collaborative-filtering.md | 3 +- .../org/apache/spark/mllib/fpm/FPGrowth.scala | 2 +- .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 6 +- .../apache/spark/mllib/recommendation/ALS.scala | 114 +++++++++---------- .../MatrixFactorizationModel.scala | 4 +- python/pyspark/mllib/fpm.py | 47 +++++--- python/pyspark/mllib/recommendation.py | 89 ++++++++++++--- 7 files changed, 164 insertions(+), 101 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/e298ac91/docs/mllib-collaborative-filtering.md ---------------------------------------------------------------------- diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md index b8f0566..5c33292 100644 --- a/docs/mllib-collaborative-filtering.md +++ b/docs/mllib-collaborative-filtering.md @@ -21,7 +21,8 @@ following parameters: * *numBlocks* is the number of blocks used to parallelize computation (set to -1 to auto-configure). * *rank* is the number of latent factors in the model. -* *iterations* is the number of iterations to run. +* *iterations* is the number of iterations of ALS to run. ALS typically converges to a reasonable + solution in 20 iterations or less. * *lambda* specifies the regularization parameter in ALS. * *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for *implicit feedback* data. http://git-wip-us.apache.org/repos/asf/spark/blob/e298ac91/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala index 1250bc1..85d6093 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala @@ -152,7 +152,7 @@ object FPGrowthModel extends Loader[FPGrowthModel[_]] { * [[http://dx.doi.org/10.1145/335191.335372 Han et al., Mining frequent patterns without candidate * generation]]. * - * @param minSupport the minimal support level of the frequent pattern, any pattern appears + * @param minSupport the minimal support level of the frequent pattern, any pattern that appears * more than (minSupport * size-of-the-dataset) times will be output * @param numPartitions number of partitions used by parallel FP-growth * http://git-wip-us.apache.org/repos/asf/spark/blob/e298ac91/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala index ed49c94..94a24b5 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala @@ -38,9 +38,9 @@ import org.apache.spark.storage.StorageLevel * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns * Efficiently by Prefix-Projected Pattern Growth ([[http://doi.org/10.1109/ICDE.2001.914830]]). * - * @param minSupport the minimal support level of the sequential pattern, any pattern appears - * more than (minSupport * size-of-the-dataset) times will be output - * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * @param minSupport the minimal support level of the sequential pattern, any pattern that appears + * more than (minSupport * size-of-the-dataset) times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears * less than maxPatternLength will be output * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the internal * storage format) allowed in a projected database before local http://git-wip-us.apache.org/repos/asf/spark/blob/e298ac91/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala index 33aaf85..3e619c4 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala @@ -218,7 +218,7 @@ class ALS private ( } /** - * Run ALS with the configured parameters on an input RDD of (user, product, rating) triples. + * Run ALS with the configured parameters on an input RDD of [[Rating]] objects. * Returns a MatrixFactorizationModel with feature vectors for each user and product. */ @Since("0.8.0") @@ -279,18 +279,17 @@ class ALS private ( @Since("0.8.0") object ALS { /** - * Train a matrix factorization model given an RDD of ratings given by users to some products, - * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the - * product of two lower-rank matrices of a given rank (number of features). To solve for these - * features, we run a given number of iterations of ALS. This is done using a level of - * parallelism given by `blocks`. + * Train a matrix factorization model given an RDD of ratings by users for a subset of products. + * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank + * (number of features). To solve for these features, ALS is run iteratively with a configurable + * level of parallelism. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter * @param blocks level of parallelism to split computation into - * @param seed random seed + * @param seed random seed for initial matrix factorization model */ @Since("0.9.1") def train( @@ -305,16 +304,15 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of ratings given by users to some products, - * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the - * product of two lower-rank matrices of a given rank (number of features). To solve for these - * features, we run a given number of iterations of ALS. This is done using a level of - * parallelism given by `blocks`. + * Train a matrix factorization model given an RDD of ratings by users for a subset of products. + * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank + * (number of features). To solve for these features, ALS is run iteratively with a configurable + * level of parallelism. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter * @param blocks level of parallelism to split computation into */ @Since("0.8.0") @@ -329,16 +327,15 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of ratings given by users to some products, - * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the - * product of two lower-rank matrices of a given rank (number of features). To solve for these - * features, we run a given number of iterations of ALS. The level of parallelism is determined - * automatically based on the number of partitions in `ratings`. + * Train a matrix factorization model given an RDD of ratings by users for a subset of products. + * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank + * (number of features). To solve for these features, ALS is run iteratively with a level of + * parallelism automatically based on the number of partitions in `ratings`. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter */ @Since("0.8.0") def train(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double) @@ -347,15 +344,14 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of ratings given by users to some products, - * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the - * product of two lower-rank matrices of a given rank (number of features). To solve for these - * features, we run a given number of iterations of ALS. The level of parallelism is determined - * automatically based on the number of partitions in `ratings`. + * Train a matrix factorization model given an RDD of ratings by users for a subset of products. + * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank + * (number of features). To solve for these features, ALS is run iteratively with a level of + * parallelism automatically based on the number of partitions in `ratings`. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) + * @param iterations number of iterations of ALS */ @Since("0.8.0") def train(ratings: RDD[Rating], rank: Int, iterations: Int) @@ -372,11 +368,11 @@ object ALS { * * @param ratings RDD of (userID, productID, rating) pairs * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter * @param blocks level of parallelism to split computation into * @param alpha confidence parameter - * @param seed random seed + * @param seed random seed for initial matrix factorization model */ @Since("0.8.1") def trainImplicit( @@ -392,16 +388,15 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of 'implicit preferences' given by users - * to some products, in the form of (userID, productID, preference) pairs. We approximate the - * ratings matrix as the product of two lower-rank matrices of a given rank (number of features). - * To solve for these features, we run a given number of iterations of ALS. This is done using - * a level of parallelism given by `blocks`. + * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a + * subset of products. The ratings matrix is approximated as the product of two lower-rank + * matrices of a given rank (number of features). To solve for these features, ALS is run + * iteratively with a configurable level of parallelism. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter * @param blocks level of parallelism to split computation into * @param alpha confidence parameter */ @@ -418,16 +413,16 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of 'implicit preferences' given by users to - * some products, in the form of (userID, productID, preference) pairs. We approximate the - * ratings matrix as the product of two lower-rank matrices of a given rank (number of features). - * To solve for these features, we run a given number of iterations of ALS. The level of - * parallelism is determined automatically based on the number of partitions in `ratings`. + * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a + * subset of products. The ratings matrix is approximated as the product of two lower-rank + * matrices of a given rank (number of features). To solve for these features, ALS is run + * iteratively with a level of parallelism determined automatically based on the number of + * partitions in `ratings`. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter * @param alpha confidence parameter */ @Since("0.8.1") @@ -437,16 +432,15 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by - * users to some products, in the form of (userID, productID, rating) pairs. We approximate the - * ratings matrix as the product of two lower-rank matrices of a given rank (number of features). - * To solve for these features, we run a given number of iterations of ALS. The level of - * parallelism is determined automatically based on the number of partitions in `ratings`. - * Model parameters `alpha` and `lambda` are set to reasonable default values + * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a + * subset of products. The ratings matrix is approximated as the product of two lower-rank + * matrices of a given rank (number of features). To solve for these features, ALS is run + * iteratively with a level of parallelism determined automatically based on the number of + * partitions in `ratings`. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) + * @param iterations number of iterations of ALS */ @Since("0.8.1") def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int) http://git-wip-us.apache.org/repos/asf/spark/blob/e298ac91/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala index 0dc4048..628cf1d 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala @@ -206,7 +206,7 @@ class MatrixFactorizationModel @Since("0.8.0") ( } /** - * Recommends topK products for all users. + * Recommends top products for all users. * * @param num how many products to return for every user. * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of @@ -224,7 +224,7 @@ class MatrixFactorizationModel @Since("0.8.0") ( /** - * Recommends topK users for all products. + * Recommends top users for all products. * * @param num how many users to return for every product. * @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array http://git-wip-us.apache.org/repos/asf/spark/blob/e298ac91/python/pyspark/mllib/fpm.py ---------------------------------------------------------------------- diff --git a/python/pyspark/mllib/fpm.py b/python/pyspark/mllib/fpm.py index 2039dec..7a2d77a 100644 --- a/python/pyspark/mllib/fpm.py +++ b/python/pyspark/mllib/fpm.py @@ -29,7 +29,6 @@ __all__ = ['FPGrowth', 'FPGrowthModel', 'PrefixSpan', 'PrefixSpanModel'] @inherit_doc @ignore_unicode_prefix class FPGrowthModel(JavaModelWrapper): - """ .. note:: Experimental @@ -68,11 +67,15 @@ class FPGrowth(object): """ Computes an FP-Growth model that contains frequent itemsets. - :param data: The input data set, each element contains a - transaction. - :param minSupport: The minimal support level (default: `0.3`). - :param numPartitions: The number of partitions used by - parallel FP-growth (default: same as input data). + :param data: + The input data set, each element contains a transaction. + :param minSupport: + The minimal support level. + (default: 0.3) + :param numPartitions: + The number of partitions used by parallel FP-growth. A value + of -1 will use the same number as input data. + (default: -1) """ model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), int(numPartitions)) return FPGrowthModel(model) @@ -128,17 +131,27 @@ class PrefixSpan(object): @since("1.6.0") def train(cls, data, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000): """ - Finds the complete set of frequent sequential patterns in the input sequences of itemsets. - - :param data: The input data set, each element contains a sequnce of itemsets. - :param minSupport: the minimal support level of the sequential pattern, any pattern appears - more than (minSupport * size-of-the-dataset) times will be output (default: `0.1`) - :param maxPatternLength: the maximal length of the sequential pattern, any pattern appears - less than maxPatternLength will be output. (default: `10`) - :param maxLocalProjDBSize: The maximum number of items (including delimiters used in - the internal storage format) allowed in a projected database before local - processing. If a projected database exceeds this size, another - iteration of distributed prefix growth is run. (default: `32000000`) + Finds the complete set of frequent sequential patterns in the + input sequences of itemsets. + + :param data: + The input data set, each element contains a sequence of + itemsets. + :param minSupport: + The minimal support level of the sequential pattern, any + pattern that appears more than (minSupport * + size-of-the-dataset) times will be output. + (default: 0.1) + :param maxPatternLength: + The maximal length of the sequential pattern, any pattern + that appears less than maxPatternLength will be output. + (default: 10) + :param maxLocalProjDBSize: + The maximum number of items (including delimiters used in the + internal storage format) allowed in a projected database before + local processing. If a projected database exceeds this size, + another iteration of distributed prefix growth is run. + (default: 32000000) """ model = callMLlibFunc("trainPrefixSpanModel", data, minSupport, maxPatternLength, maxLocalProjDBSize) http://git-wip-us.apache.org/repos/asf/spark/blob/e298ac91/python/pyspark/mllib/recommendation.py ---------------------------------------------------------------------- diff --git a/python/pyspark/mllib/recommendation.py b/python/pyspark/mllib/recommendation.py index 93e47a7..7e60255 100644 --- a/python/pyspark/mllib/recommendation.py +++ b/python/pyspark/mllib/recommendation.py @@ -138,7 +138,8 @@ class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader): @since("0.9.0") def predictAll(self, user_product): """ - Returns a list of predicted ratings for input user and product pairs. + Returns a list of predicted ratings for input user and product + pairs. """ assert isinstance(user_product, RDD), "user_product should be RDD of (user, product)" first = user_product.first() @@ -165,28 +166,33 @@ class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader): @since("1.4.0") def recommendUsers(self, product, num): """ - Recommends the top "num" number of users for a given product and returns a list - of Rating objects sorted by the predicted rating in descending order. + Recommends the top "num" number of users for a given product and + returns a list of Rating objects sorted by the predicted rating in + descending order. """ return list(self.call("recommendUsers", product, num)) @since("1.4.0") def recommendProducts(self, user, num): """ - Recommends the top "num" number of products for a given user and returns a list - of Rating objects sorted by the predicted rating in descending order. + Recommends the top "num" number of products for a given user and + returns a list of Rating objects sorted by the predicted rating in + descending order. """ return list(self.call("recommendProducts", user, num)) def recommendProductsForUsers(self, num): """ - Recommends top "num" products for all users. The number returned may be less than this. + Recommends the top "num" number of products for all users. The + number of recommendations returned per user may be less than "num". """ return self.call("wrappedRecommendProductsForUsers", num) def recommendUsersForProducts(self, num): """ - Recommends top "num" users for all products. The number returned may be less than this. + Recommends the top "num" number of users for all products. The + number of recommendations returned per product may be less than + "num". """ return self.call("wrappedRecommendUsersForProducts", num) @@ -234,11 +240,34 @@ class ALS(object): def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False, seed=None): """ - Train a matrix factorization model given an RDD of ratings given by users to some products, - in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the - product of two lower-rank matrices of a given rank (number of features). To solve for these - features, we run a given number of iterations of ALS. This is done using a level of - parallelism given by `blocks`. + Train a matrix factorization model given an RDD of ratings by users + for a subset of products. The ratings matrix is approximated as the + product of two lower-rank matrices of a given rank (number of + features). To solve for these features, ALS is run iteratively with + a configurable level of parallelism. + + :param ratings: + RDD of `Rating` or (userID, productID, rating) tuple. + :param rank: + Rank of the feature matrices computed (number of features). + :param iterations: + Number of iterations of ALS. + (default: 5) + :param lambda_: + Regularization parameter. + (default: 0.01) + :param blocks: + Number of blocks used to parallelize the computation. A value + of -1 will use an auto-configured number of blocks. + (default: -1) + :param nonnegative: + A value of True will solve least-squares with nonnegativity + constraints. + (default: False) + :param seed: + Random seed for initial matrix factorization model. A value + of None will use system time as the seed. + (default: None) """ model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations, lambda_, blocks, nonnegative, seed) @@ -249,11 +278,37 @@ class ALS(object): def trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, alpha=0.01, nonnegative=False, seed=None): """ - Train a matrix factorization model given an RDD of 'implicit preferences' given by users - to some products, in the form of (userID, productID, preference) pairs. We approximate the - ratings matrix as the product of two lower-rank matrices of a given rank (number of - features). To solve for these features, we run a given number of iterations of ALS. - This is done using a level of parallelism given by `blocks`. + Train a matrix factorization model given an RDD of 'implicit + preferences' of users for a subset of products. The ratings matrix + is approximated as the product of two lower-rank matrices of a + given rank (number of features). To solve for these features, ALS + is run iteratively with a configurable level of parallelism. + + :param ratings: + RDD of `Rating` or (userID, productID, rating) tuple. + :param rank: + Rank of the feature matrices computed (number of features). + :param iterations: + Number of iterations of ALS. + (default: 5) + :param lambda_: + Regularization parameter. + (default: 0.01) + :param blocks: + Number of blocks used to parallelize the computation. A value + of -1 will use an auto-configured number of blocks. + (default: -1) + :param alpha: + A constant used in computing confidence. + (default: 0.01) + :param nonnegative: + A value of True will solve least-squares with nonnegativity + constraints. + (default: False) + :param seed: + Random seed for initial matrix factorization model. A value + of None will use system time as the seed. + (default: None) """ model = callMLlibFunc("trainImplicitALSModel", cls._prepare(ratings), rank, iterations, lambda_, blocks, alpha, nonnegative, seed) --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
