(spark) branch master updated: [SPARK-55655][MLLIB] Make `CountVectorizer` vocabulary deterministic when counts are equal

dongjoon Tue, 24 Feb 2026 07:12:30 -0800

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new a5316a425282 [SPARK-55655][MLLIB] Make `CountVectorizer` vocabulary 
deterministic when counts are equal
a5316a425282 is described below

commit a5316a42528252b34f6ea3321d826ffd3bbb9b89
Author: yangjie01 <[email protected]>
AuthorDate: Tue Feb 24 07:10:21 2026 -0800

    [SPARK-55655][MLLIB] Make `CountVectorizer` vocabulary deterministic when 
counts are equal
    
    ### What changes were proposed in this pull request?
    This pr fix `CountVectorizer` to use a deterministic ordering when 
selecting the top vocabulary terms. Specifically, when two terms have the same 
frequency (count), they are now sorted by the term itself (lexicographically) 
as a tie-breaker.
    
    ### Why are the changes needed?
    Currently, `CountVectorizer` uses `wordCounts.top(...)(Ordering.by(_._2))` 
to select the vocabulary. This comparison only considers term counts. When 
multiple terms have the same count, the resulting order in the vocabulary is 
non-deterministic and depends on the RDD partition processing order or the 
iteration order of the internal hash maps.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    - Pass Github Actions
    - Added a new test case in `CountVectorizerSuite` that intentionally 
creates a dataset with tied term counts and asserts a specific, deterministic 
vocabulary order.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #54446 from LuciferYang/SPARK-55655.
    
    Authored-by: yangjie01 <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 .../org/apache/spark/ml/feature/CountVectorizer.scala      |  5 ++++-
 .../org/apache/spark/ml/feature/CountVectorizerSuite.scala | 14 ++++++++++++++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
index 060e445e0254..c7fc4ce6898b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
@@ -235,8 +235,11 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0") 
override val uid: String)
 
     val fullVocabSize = wordCounts.count()
 
+    val ordering = Ordering.Tuple2(Ordering.Long, Ordering.String.reverse)
+      .on[(String, Long)] { case (word, count) => (count, word) }
+
     val vocab = wordCounts
-      .top(math.min(fullVocabSize, vocSize).toInt)(Ordering.by(_._2))
+      .top(math.min(fullVocabSize, vocSize).toInt)(ordering)
       .map(_._1)
 
     if (input.getStorageLevel != StorageLevel.NONE) {
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
index 431772006c82..ecde4e92a34f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
@@ -350,4 +350,18 @@ class CountVectorizerSuite extends MLTest with 
DefaultReadWriteTest {
         assert(features === Vectors.sparse(0, Seq()))
     }
   }
+
+  test("SPARK-55655: CountVectorizer vocabulary ordering is deterministic for 
tied counts") {
+    val df = Seq(
+      (0, split("a b c d e")),
+      (1, split("e d c b a"))
+    ).toDF("id", "words")
+
+    val cvModel = new CountVectorizer()
+      .setInputCol("words")
+      .setOutputCol("features")
+      .fit(df)
+
+    assert(cvModel.vocabulary === Array("a", "b", "c", "d", "e"))
+  }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-55655][MLLIB] Make `CountVectorizer` vocabulary deterministic when counts are equal

Reply via email to