Word2vec: A Simple Way to Learn Word Embeddings

Anindya Naskar
6 min readAug 24, 2023

--

how word2vec models works, skip-gram, cbow, fasttext
thinkinfi.com

Word embeddings are numerical representations of words that capture their meaning, usage, and context. They are useful for many natural language processing tasks, such as text classification, sentiment analysis, machine translation, and question answering. But how can we learn word embeddings from a large corpus of text? One popular technique is word2vec, which was introduced by Mikolov et al. in 2013.

What is word2vec?

Word2vec is not a single algorithm, but a family of models that use a shallow neural network to learn word embeddings from a large corpus of text. The basic idea is to train the network to predict words based on their surrounding context, or vice versa. There are two main variants of word2vec:

how skip gram word2vec algorithm works
thinkinfi.com
  • Skip-gram model: This model predicts the context words given a target word. For example, given the word “road”, the model might predict “wide”, “shimmered”, and “sun” as its context words.
how cbow or continuas bag of word model works
thinkinfi.com
  • Continuous bag-of-words (CBOW) model: This model predicts the target word given its context words. For example, given the words “wide”, “shimmered”, and “sun”, the model might predict “road” as the target word.

The skip-gram model tends to perform better on large datasets and learn more general embeddings, while the CBOW model tends to be faster and learn more frequent words better.

Also Read:

How does word2vec work?

The word2vec models use a simple neural network with one hidden layer and a softmax output layer. The input layer consists of one-hot encoded vectors that represent the target word (for skip-gram) or the context words (for CBOW). The hidden layer consists of a weight matrix that maps the input vectors to lower-dimensional vectors, which are the word embeddings. The output layer consists of another weight matrix that maps the hidden vectors to one-hot encoded vectors that represent the predicted words.

The network is trained by minimizing the cross-entropy loss between the predicted words and the actual words using stochastic gradient descent or other optimization methods. The loss function can be written as:

where T is the number of words in the corpus, c is the window size that defines the context, wt​ is the target word at position t, and p(wt+j​∣wt​) is the probability of predicting word wt+j​ given word wt​, which is computed by the softmax function:

where W is the size of the vocabulary, vw​ is the input vector (or embedding) of word w, and vw′​ is the output vector of word w.

What are some challenges and solutions for word2vec?

One challenge for word2vec is that it can be computationally expensive to train on large corpora and vocabularies. The softmax function requires calculating the dot product between the input vector and every output vector in the vocabulary, which can be slow and memory-intensive.

One solution for this problem is to use negative sampling, which simplifies the softmax function by only considering a few negative examples (words that are not in the context) instead of all possible words. The idea is to train the network to distinguish between the positive examples (words that are in the context) and the negative examples (words that are randomly sampled from a noise distribution). The loss function for negative sampling can be written as:

where σ(x)=1/(1+e−x) is the sigmoid function, k is the number of negative samples, and Pn​(w) is the noise distribution, which is usually chosen as the unigram distribution raised to the power of 0.752.

Another challenge for word2vec is that it does not capture the meaning of words that have multiple senses or meanings, such as “bank” or “crane”. This can lead to poor embeddings that mix up different senses of the same word.

One solution for this problem is to use subword information, which splits words into smaller units, such as characters, n-grams, or morphemes. This can help the model learn the meaning of rare or unknown words, as well as the meaning of words based on their subword components. For example, the word “airplane” can be split into “air” and “plane”, which can help the model learn that it is related to both “sky” and “vehicle”. One popular method that uses subword information is fastText, which extends the skip-gram model with a new output layer that represents words as the sum of their subword vectors.

Also Read:

How can we use word2vec in practice?

To use word2vec in practice, we need to choose a corpus of text, a model architecture (skip-gram or CBOW), a window size, a vector dimension, an optimization method, and other hyperparameters. Then, we need to train the model on the corpus using a framework such as TensorFlow, PyTorch, or Gensim. After training, we can obtain the word embeddings from the hidden layer of the network and use them for various downstream tasks.

One way to evaluate the quality of the word embeddings is to use intrinsic methods, which measure how well the embeddings capture linguistic properties such as similarity, analogy, or relatedness. For example, we can use cosine similarity to measure how close two word vectors are in the vector space, or we can use analogy tests to measure how well the embeddings can answer questions such as “king is to queen as man is to ?”. There are many datasets and benchmarks for intrinsic evaluation, such as WordSim-353, SimLex-999, or Google’s analogy test set.

Another way to evaluate the quality of the word embeddings is to use extrinsic methods, which measure how well the embeddings improve the performance of downstream tasks that use them as features or inputs. For example, we can use word embeddings for text classification, sentiment analysis, machine translation, or question answering. There are many datasets and benchmarks for extrinsic evaluation, such as GLUE, SQuAD, or WMT.

Conclusion

Word2vec is a simple and powerful technique to learn word embeddings from large corpora of text. It uses a shallow neural network to predict words based on their context, or vice versa. It has two main variants: skip-gram and CBOW. It also has some challenges and solutions, such as negative sampling and subword information. It can be used for various natural language processing tasks and evaluated by intrinsic and extrinsic methods.

Reference:

Word2vec Skip gram Explained

Continuous Bag of Words (CBOW) — Single Word Model — How It Works

Continuous Bag of Words (CBOW) — Multi Word Model — How It Works

I hope you enjoyed this article about word2vec. If you have any questions or feedback, please let me know in the comments below. Thank you for reading! 😊

--

--