Repository: spark
Updated Branches:
  refs/heads/master 3c8fa5059 -> eef779b8d


[SPARK-2842][MLlib]Word2Vec documentation

mengxr
Documentation for Word2Vec

Author: Liquan Pei <[email protected]>

Closes #2003 from Ishiihara/Word2Vec-doc and squashes the following commits:

4ff11d4 [Liquan Pei] minor fix
8d7458f [Liquan Pei] code reformat
6df0dcb [Liquan Pei] add Word2Vec documentation


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eef779b8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eef779b8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eef779b8

Branch: refs/heads/master
Commit: eef779b8d631de971d440051cae21040f4de558f
Parents: 3c8fa50
Author: Liquan Pei <[email protected]>
Authored: Sun Aug 17 23:30:47 2014 -0700
Committer: Xiangrui Meng <[email protected]>
Committed: Sun Aug 17 23:30:47 2014 -0700

----------------------------------------------------------------------
 docs/mllib-feature-extraction.md | 63 ++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/eef779b8/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 21453cb..4b3cb71 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -9,4 +9,65 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature 
Extraction
 
 ## Word2Vec 
 
-## TFIDF
+Word2Vec computes distributed vector representation of words. The main 
advantage of the distributed
+representations is that similar words are close in the vector space, which 
makes generalization to 
+novel patterns easier and model estimation more robust. Distributed vector 
representation is 
+showed to be useful in many natural language processing applications such as 
named entity 
+recognition, disambiguation, parsing, tagging and machine translation.
+
+### Model
+
+In our implementation of Word2Vec, we used skip-gram model. The training 
objective of skip-gram is
+to learn word vector representations that are good at predicting its context 
in the same sentence. 
+Mathematically, given a sequence of training words `$w_1, w_2, \dots, w_T$`, 
the objective of the
+skip-gram model is to maximize the average log-likelihood 
+`\[
+\frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)
+\]`
+where $k$ is the size of the training window.  
+
+In the skip-gram model, every word $w$ is associated with two vectors $u_w$ 
and $v_w$ which are 
+vector representations of $w$ as word and context respectively. The 
probability of correctly 
+predicting word $w_i$ given word $w_j$ is determined by the softmax model, 
which is 
+`\[
+p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} 
\exp(u_l^{\top}v_{w_j})}
+\]`
+where $V$ is the vocabulary size. 
+
+The skip-gram model with softmax is expensive because the cost of computing 
$\log p(w_i | w_j)$ 
+is proportional to $V$, which can be easily in order of millions. To speed up 
training of Word2Vec, 
+we used hierarchical softmax, which reduced the complexity of computing of 
$\log p(w_i | w_j)$ to
+$O(\log(V))$
+
+### Example 
+
+The example below demonstrates how to load a text file, parse it as an RDD of 
`Seq[String]`,
+construct a `Word2Vec` instance and then fit a `Word2VecModel` with the input 
data. Finally,
+we display the top 40 synonyms of the specified word. To run the example, 
first download
+the [text8](http://mattmahoney.net/dc/text8.zip) data and extract it to your 
preferred directory.
+Here we assume the extracted file is `text8` and in same directory as you run 
the spark shell.  
+
+<div class="codetabs">
+<div data-lang="scala">
+{% highlight scala %}
+import org.apache.spark._
+import org.apache.spark.rdd._
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.feature.Word2Vec
+
+val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
+
+val word2vec = new Word2Vec()
+
+val model = word2vec.fit(input)
+
+val synonyms = model.findSynonyms("china", 40)
+
+for((synonym, cosineSimilarity) <- synonyms) {
+  println(s"$synonym $cosineSimilarity")
+}
+{% endhighlight %}
+</div>
+</div>
+
+## TFIDF
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to