This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.1 by this push:
new 2d4a515 [SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in
UnivariateFeatureSelector document
2d4a515 is described below
commit 2d4a51568dde4baac6d660f378223a94b074fcec
Author: Liang-Chi Hsieh <[email protected]>
AuthorDate: Wed Feb 10 09:24:25 2021 +0900
[SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in
UnivariateFeatureSelector document
### What changes were proposed in this pull request?
This follows up #31160 to update score function in the document.
### Why are the changes needed?
Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the
sklearn's naming. It is good to have it but I think it is nice if we have
formal score function name with sklearn's ones.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No, only doc change.
Closes #31531 from viirya/SPARK-34080-minor.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 1fbd5764105e2c09caf4ab57a7095dd794307b02)
Signed-off-by: HyukjinKwon <[email protected]>
---
docs/ml-features.md | 6 +++---
.../main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala | 2 +-
.../org/apache/spark/ml/feature/UnivariateFeatureSelector.scala | 9 ++++++---
python/pyspark/ml/feature.py | 9 ++++++---
4 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 2bb8873..b36b076 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1802,9 +1802,9 @@ User can set `featureType` and `labelType`, and Spark
will pick the score functi
~~~
featureType | labelType |score function
------------|------------|--------------
-categorical |categorical | chi2
-continuous |categorical | f_classif
-continuous |continuous | f_regression
+categorical |categorical | chi-squared (chi2)
+continuous |categorical | ANOVATest (f_classif)
+continuous |continuous | F-value (f_regression)
~~~
It supports five selection modes: `numTopFeatures`, `percentile`, `fpr`,
`fdr`, `fwe`:
diff --git
a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
index 198a886..fc6c615 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
@@ -44,7 +44,7 @@ import org.apache.spark.sql.types.StructType
* By default, the selection method is `numTopFeatures`, with the default
number of top features
* set to 50.
*/
-@deprecated("use UnivariateFeatureSelector instead", "3.1.0")
+@deprecated("use UnivariateFeatureSelector instead", "3.1.1")
@Since("1.6.0")
final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid:
String)
extends Selector[ChiSqSelectorModel] {
diff --git
a/mllib/src/main/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.scala
b/mllib/src/main/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.scala
index bfe1d5f..7fff159 100644
---
a/mllib/src/main/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.scala
+++
b/mllib/src/main/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.scala
@@ -100,9 +100,12 @@ private[feature] trait UnivariateFeatureSelectorParams
extends Params
* The user can set `featureType` and labelType`, and Spark will pick the
score function based on
* the specified `featureType` and labelType`.
* The following combination of `featureType` and `labelType` are supported:
- * - `featureType` `categorical` and `labelType` `categorical`: Spark uses
chi2.
- * - `featureType` `continuous` and `labelType` `categorical`: Spark uses
f_classif.
- * - `featureType` `continuous` and `labelType` `continuous`: Spark uses
f_regression.
+ * - `featureType` `categorical` and `labelType` `categorical`: Spark uses
chi-squared,
+ * i.e. chi2 in sklearn.
+ * - `featureType` `continuous` and `labelType` `categorical`: Spark uses
ANOVATest,
+ * i.e. f_classif in sklearn.
+ * - `featureType` `continuous` and `labelType` `continuous`: Spark uses
F-value,
+ * i.e. f_regression in sklearn.
*
* The `UnivariateFeatureSelector` supports different selection modes:
`numTopFeatures`,
* `percentile`, `fpr`, `fdr`, `fwe`.
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index f9d22ba..4e8b8b4 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -5821,9 +5821,12 @@ class UnivariateFeatureSelector(JavaEstimator,
_UnivariateFeatureSelectorParams,
The following combination of `featureType` and `labelType` are supported:
- - `featureType` `categorical` and `labelType` `categorical`, Spark uses
chi2.
- - `featureType` `continuous` and `labelType` `categorical`, Spark uses
f_classif.
- - `featureType` `continuous` and `labelType` `continuous`, Spark uses
f_regression.
+ - `featureType` `categorical` and `labelType` `categorical`, Spark uses
chi-squared,
+ i.e. chi2 in sklearn.
+ - `featureType` `continuous` and `labelType` `categorical`, Spark uses
ANOVATest,
+ i.e. f_classif in sklearn.
+ - `featureType` `continuous` and `labelType` `continuous`, Spark uses
F-value,
+ i.e. f_regression in sklearn.
The `UnivariateFeatureSelector` supports different selection modes:
`numTopFeatures`,
`percentile`, `fpr`, `fdr`, `fwe`.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]