This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new a5316a425282 [SPARK-55655][MLLIB] Make `CountVectorizer` vocabulary
deterministic when counts are equal
a5316a425282 is described below
commit a5316a42528252b34f6ea3321d826ffd3bbb9b89
Author: yangjie01 <[email protected]>
AuthorDate: Tue Feb 24 07:10:21 2026 -0800
[SPARK-55655][MLLIB] Make `CountVectorizer` vocabulary deterministic when
counts are equal
### What changes were proposed in this pull request?
This pr fix `CountVectorizer` to use a deterministic ordering when
selecting the top vocabulary terms. Specifically, when two terms have the same
frequency (count), they are now sorted by the term itself (lexicographically)
as a tie-breaker.
### Why are the changes needed?
Currently, `CountVectorizer` uses `wordCounts.top(...)(Ordering.by(_._2))`
to select the vocabulary. This comparison only considers term counts. When
multiple terms have the same count, the resulting order in the vocabulary is
non-deterministic and depends on the RDD partition processing order or the
iteration order of the internal hash maps.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass Github Actions
- Added a new test case in `CountVectorizerSuite` that intentionally
creates a dataset with tied term counts and asserts a specific, deterministic
vocabulary order.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #54446 from LuciferYang/SPARK-55655.
Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
---
.../org/apache/spark/ml/feature/CountVectorizer.scala | 5 ++++-
.../org/apache/spark/ml/feature/CountVectorizerSuite.scala | 14 ++++++++++++++
2 files changed, 18 insertions(+), 1 deletion(-)
diff --git
a/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
index 060e445e0254..c7fc4ce6898b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
@@ -235,8 +235,11 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0")
override val uid: String)
val fullVocabSize = wordCounts.count()
+ val ordering = Ordering.Tuple2(Ordering.Long, Ordering.String.reverse)
+ .on[(String, Long)] { case (word, count) => (count, word) }
+
val vocab = wordCounts
- .top(math.min(fullVocabSize, vocSize).toInt)(Ordering.by(_._2))
+ .top(math.min(fullVocabSize, vocSize).toInt)(ordering)
.map(_._1)
if (input.getStorageLevel != StorageLevel.NONE) {
diff --git
a/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
b/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
index 431772006c82..ecde4e92a34f 100644
---
a/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
+++
b/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
@@ -350,4 +350,18 @@ class CountVectorizerSuite extends MLTest with
DefaultReadWriteTest {
assert(features === Vectors.sparse(0, Seq()))
}
}
+
+ test("SPARK-55655: CountVectorizer vocabulary ordering is deterministic for
tied counts") {
+ val df = Seq(
+ (0, split("a b c d e")),
+ (1, split("e d c b a"))
+ ).toDF("id", "words")
+
+ val cvModel = new CountVectorizer()
+ .setInputCol("words")
+ .setOutputCol("features")
+ .fit(df)
+
+ assert(cvModel.vocabulary === Array("a", "b", "c", "d", "e"))
+ }
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]