This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 513f76a [SPARK-30934][ML][DOCS] Update ml-guide and
ml-migration-guide for 3.0 release
513f76a is described below
commit 513f76ac38f174ca1adf6faff087b95f4d8f9750
Author: Huaxin Gao <[email protected]>
AuthorDate: Sat Mar 7 18:09:00 2020 -0600
[SPARK-30934][ML][DOCS] Update ml-guide and ml-migration-guide for 3.0
release
### What changes were proposed in this pull request?
Update ml-guide and ml-migration-guide for 3.0.
### Why are the changes needed?
This is required for each release.
### Does this PR introduce any user-facing change?
Yes.




### How was this patch tested?
Manually build and check.
Closes #27785 from huaxingao/spark-30934.
Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
---
docs/ml-guide.md | 46 ++++++++++++++++++++---------------------
docs/ml-migration-guide.md | 51 +++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 72 insertions(+), 25 deletions(-)
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index 7b4fa4f..2037285 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -41,8 +41,6 @@ The primary Machine Learning API for Spark is now the
[DataFrame](sql-programmin
* MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
* MLlib will not add new features to the RDD-based API.
* In the Spark 2.x releases, MLlib will add features to the DataFrames-based
API to reach feature parity with the RDD-based API.
-* After reaching feature parity (roughly estimated for Spark 2.3), the
RDD-based API will be deprecated.
-* The RDD-based API is expected to be removed in Spark 3.0.
*Why is MLlib switching to the DataFrame-based API?*
@@ -87,31 +85,31 @@ To use MLlib in Python, you will need
[NumPy](http://www.numpy.org) version 1.4
[^1]: To learn more about the benefits and background of system optimised
natives, you may wish to
watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in
Scala](http://fommil.github.io/scalax14/#/).
-# Highlights in 2.3
+# Highlights in 3.0
-The list below highlights some of the new features and enhancements added to
MLlib in the `2.3`
+The list below highlights some of the new features and enhancements added to
MLlib in the `3.0`
release of Spark:
-* Built-in support for reading images into a `DataFrame` was added
-([SPARK-21866](https://issues.apache.org/jira/browse/SPARK-21866)).
-* [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator) was
added, and should be
-used instead of the existing `OneHotEncoder` transformer. The new estimator
supports
-transforming multiple columns.
-* Multiple column support was also added to `QuantileDiscretizer` and
`Bucketizer`
-([SPARK-22397](https://issues.apache.org/jira/browse/SPARK-22397) and
-[SPARK-20542](https://issues.apache.org/jira/browse/SPARK-20542))
-* A new [`FeatureHasher`](ml-features.html#featurehasher) transformer was added
- ([SPARK-13969](https://issues.apache.org/jira/browse/SPARK-13969)).
-* Added support for evaluating multiple models in parallel when performing
cross-validation using
-[`TrainValidationSplit` or `CrossValidator`](ml-tuning.html)
-([SPARK-19357](https://issues.apache.org/jira/browse/SPARK-19357)).
-* Improved support for custom pipeline components in Python (see
-[SPARK-21633](https://issues.apache.org/jira/browse/SPARK-21633) and
-[SPARK-21542](https://issues.apache.org/jira/browse/SPARK-21542)).
-* `DataFrame` functions for descriptive summary statistics over vector columns
-([SPARK-19634](https://issues.apache.org/jira/browse/SPARK-19634)).
-* Robust linear regression with Huber loss
-([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)).
+* Multiple columns support was added to `Binarizer`
([SPARK-23578](https://issues.apache.org/jira/browse/SPARK-23578)),
`StringIndexer`
([SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215)),
`StopWordsRemover`
([SPARK-29808](https://issues.apache.org/jira/browse/SPARK-29808)) and PySpark
`QuantileDiscretizer`
([SPARK-22796](https://issues.apache.org/jira/browse/SPARK-22796)).
+* Support Tree-Based Feature Transformation was added
+([SPARK-13677](https://issues.apache.org/jira/browse/SPARK-13677)).
+* Two new evaluators `MultilabelClassificationEvaluator`
([SPARK-16692](https://issues.apache.org/jira/browse/SPARK-16692)) and
`RankingEvaluator`
([SPARK-28045](https://issues.apache.org/jira/browse/SPARK-28045)) were added.
+* Sample weights support was added in `DecisionTreeClassifier/Regressor`
([SPARK-19591](https://issues.apache.org/jira/browse/SPARK-19591)),
`RandomForestClassifier/Regressor`
([SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478)),
`GBTClassifier/Regressor`
([SPARK-9612](https://issues.apache.org/jira/browse/SPARK-9612)),
`RegressionEvaluator`
([SPARK-24102](https://issues.apache.org/jira/browse/SPARK-24102)),
`BinaryClassificationEvaluator` ([SPARK-24103](https://issues.apach [...]
+* R API for `PowerIterationClustering` was added
+([SPARK-19827](https://issues.apache.org/jira/browse/SPARK-19827)).
+* Added Spark ML listener for tracking ML pipeline status
+([SPARK-23674](https://issues.apache.org/jira/browse/SPARK-23674)).
+* Fit with validation set was added to Gradient Boosted Trees in Python
+([SPARK-24333](https://issues.apache.org/jira/browse/SPARK-24333)).
+* [`RobustScaler`](ml-features.html#robustscaler) transformer was added
+([SPARK-28399](https://issues.apache.org/jira/browse/SPARK-28399)).
+* [`Factorization
Machines`](ml-classification-regression.html#factorization-machines) classifier
and regressor were added
+([SPARK-29224](https://issues.apache.org/jira/browse/SPARK-29224)).
+* Gaussian Naive Bayes
([SPARK-16872](https://issues.apache.org/jira/browse/SPARK-16872)) and
Complement Naive Bayes
([SPARK-29942](https://issues.apache.org/jira/browse/SPARK-29942)) were added.
+* ML function parity between Scala and Python
+([SPARK-28958](https://issues.apache.org/jira/browse/SPARK-28958)).
+* `predictRaw` is made public in all the Classification models.
`predictProbability` is made public in all the Classification models except
`LinearSVCModel`.
+([SPARK-30358](https://issues.apache.org/jira/browse/SPARK-30358)).
# Migration Guide
diff --git a/docs/ml-migration-guide.md b/docs/ml-migration-guide.md
index f3cd762..4e6d68f 100644
--- a/docs/ml-migration-guide.md
+++ b/docs/ml-migration-guide.md
@@ -33,16 +33,65 @@ Please refer [Migration Guide: SQL, Datasets and
DataFrame](sql-migration-guide.
* `OneHotEncoder` which is deprecated in 2.3, is removed in 3.0 and
`OneHotEncoderEstimator` is now renamed to `OneHotEncoder`.
* `org.apache.spark.ml.image.ImageSchema.readImages` which is deprecated in
2.3, is removed in 3.0, use `spark.read.format('image')` instead.
+* `org.apache.spark.mllib.clustering.KMeans.train` with param Int `runs` which
is deprecated in 2.1, is removed in 3.0. Use `train` method without `runs`
instead.
+* `org.apache.spark.mllib.classification.LogisticRegressionWithSGD` which is
deprecated in 2.0, is removed in 3.0, use
`org.apache.spark.ml.classification.LogisticRegression` or
`spark.mllib.classification.LogisticRegressionWithLBFGS` instead.
+* `org.apache.spark.mllib.feature.ChiSqSelectorModel.isSorted ` which is
deprecated in 2.1, is removed in 3.0, is not intended for subclasses to use.
+* `org.apache.spark.mllib.regression.RidgeRegressionWithSGD` which is
deprecated in 2.0, is removed in 3.0, use
`org.apache.spark.ml.regression.LinearRegression` with `elasticNetParam` = 0.0.
Note the default `regParam` is 0.01 for `RidgeRegressionWithSGD`, but is 0.0
for `LinearRegression`.
+* `org.apache.spark.mllib.regression.LassoWithSGD` which is deprecated in 2.0,
is removed in 3.0, use `org.apache.spark.ml.regression.LinearRegression` with
`elasticNetParam` = 1.0. Note the default `regParam` is 0.01 for
`LassoWithSGD`, but is 0.0 for `LinearRegression`.
+* `org.apache.spark.mllib.regression.LinearRegressionWithSGD` which is
deprecated in 2.0, is removed in 3.0, use
`org.apache.spark.ml.regression.LinearRegression` or `LBFGS` instead.
+* `org.apache.spark.mllib.clustering.KMeans.getRuns` and `setRuns` which are
deprecated in 2.1, are removed in 3.0, have no effect since Spark 2.0.0.
+* `org.apache.spark.ml.LinearSVCModel.setWeightCol` which is deprecated in
2.4, is removed in 3.0, is not intended for users.
+* From 3.0,
`org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel`
extends `MultilayerPerceptronParams` to expose the training params. As a
result, `layers` in `MultilayerPerceptronClassificationModel` has been changed
from `Array[Int]` to `IntArrayParam`. Users should use
`MultilayerPerceptronClassificationModel.getLayers` instead of
`MultilayerPerceptronClassificationModel.layers` to retrieve the size of layers.
+* `org.apache.spark.ml.classification.GBTClassifier.numTrees` which is
deprecated in 2.4.5, is removed in 3.0, use `getNumTrees` instead.
+* `org.apache.spark.ml.clustering.KMeansModel.computeCost` which is deprecated
in 2.4, is removed in 3.0, use `ClusteringEvaluator` instead.
+* The member variable `precision` in
`org.apache.spark.mllib.evaluation.MulticlassMetrics` which is deprecated in
2.0, is removed in 3.0. Use `accuracy` instead.
+* The member variable `recall` in
`org.apache.spark.mllib.evaluation.MulticlassMetrics` which is deprecated in
2.0, is removed in 3.0. Use `accuracy` instead.
+* The member variable `fMeasure` in
`org.apache.spark.mllib.evaluation.MulticlassMetrics` which is deprecated in
2.0, is removed in 3.0. Use `accuracy` instead.
+* `org.apache.spark.ml.util.GeneralMLWriter.context` which is deprecated in
2.0, is removed in 3.0, use `session` instead.
+* `org.apache.spark.ml.util.MLWriter.context` which is deprecated in 2.0, is
removed in 3.0, use `session` instead.
+* `org.apache.spark.ml.util.MLReader.context` which is deprecated in 2.0, is
removed in 3.0, use `session` instead.
+* `abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT,
T]]` is changed to `abstract class UnaryTransformer[IN: TypeTag, OUT: TypeTag,
T <: UnaryTransformer[IN, OUT, T]]` in 3.0.
-### Changes of behavior
+### Deprecations and changes of behavior
{:.no_toc}
+**Deprecations**
+
+* [SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215):
+`labels` in `StringIndexerModel` is deprecated and will be removed in 3.1.0.
Use `labelsArray` instead.
+* [SPARK-25758](https://issues.apache.org/jira/browse/SPARK-25758):
+`computeCost` in `BisectingKMeansModel` is deprecated and will be removed in
future versions. Use `ClusteringEvaluator` instead.
+
+**Changes of behavior**
+
* [SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215):
In Spark 2.4 and previous versions, when specifying `frequencyDesc` or
`frequencyAsc` as
`stringOrderType` param in `StringIndexer`, in case of equal frequency, the
order of
strings is undefined. Since Spark 3.0, the strings with equal frequency are
further
sorted by alphabet. And since Spark 3.0, `StringIndexer` supports encoding
multiple
columns.
+ * [SPARK-20604](https://issues.apache.org/jira/browse/SPARK-20604):
+ In prior to 3.0 releases, `Imputer` requires input column to be Double or
Float. In 3.0, this
+ restriction is lifted so `Imputer` can handle all numeric types.
+* [SPARK-23469](https://issues.apache.org/jira/browse/SPARK-23469):
+In Spark 3.0, the `HashingTF` Transformer uses a corrected implementation of
the murmur3 hash
+function to hash elements to vectors. `HashingTF` in Spark 3.0 will map
elements to
+different positions in vectors than in Spark 2. However, `HashingTF` created
with Spark 2.x
+and loaded with Spark 3.0 will still use the previous hash function and will
not change behavior.
+* [SPARK-28969](https://issues.apache.org/jira/browse/SPARK-28969):
+The `setClassifier` method in PySpark's `OneVsRestModel` has been removed in
3.0 for parity with
+the Scala implementation. Callers should not need to set the classifier in the
model after
+creation.
+* [SPARK-25790](https://issues.apache.org/jira/browse/SPARK-25790):
+ PCA adds the support for more than 65535 column matrix in Spark 3.0.
+* [SPARK-28927](https://issues.apache.org/jira/browse/SPARK-28927):
+ When fitting ALS model on nondeterministic input data, previously if rerun
happens, users
+ would see ArrayIndexOutOfBoundsException caused by mismatch between In/Out
user/item blocks.
+ From 3.0, a SparkException with more clear message will be thrown, and
original
+ ArrayIndexOutOfBoundsException is wrapped.
+* [SPARK-29232](https://issues.apache.org/jira/browse/SPARK-29232):
+ In prior to 3.0 releases, `RandomForestRegressionModel` doesn't update the
parameter maps
+ of the DecisionTreeRegressionModels underneath. This is fixed in 3.0.
## Upgrading from MLlib 2.2 to 2.3
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]