Negative Sampling vs Hierarchical Softmax

Flavien Vidal
6 min readNov 2, 2021

--

In this article, we explore and compare the two approaches presented by Mikolov et al. in their paper: “Distributed Representations of Words and Phrases and their Compositionality”.

In this paper, which is strictly speaking a follow-up to my previous paper, we discuss more detailed explanations of the methodologies used to generate word embeddings. In particular, we explore and compare the two approaches presented by Mikolov et al. in their paper: “Distributed Representations of Words and Phrases and their Compositionality”. When generating word emebddings, the last activation layer is the softmax function. It produces a multinomial probability distribution; we consider the output vector we receive with it as our vector representation of the word (context word or just output word, depending on the model we are working with). Why do we need new practices to produce our word vector? What benefits do they bring us? How do they perform and what is the difference? Let’s start the story, and I guess it will become clear to us.

Negative Sampling

In their paper, Mikolov et al. present Negative Sampling approach. While negative sampling is based on the Skip-Gram model, it is in fact optimizing a different objective. Consider a pair (w, c) of word and context. Did this pair come from the training data? Let’s denote by P(D = 1|w, c) the probability that (w, c) came from the corpus data. Correspondingly, P(D = 0|w, c) will be the probability that (w, c) did not come from the corpus data. First, let’s model P(D = 1|w, c) with the sigmoid function:

Now, we build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it indeed is, and maximize the probability of a word and context not being in the corpus data if it indeed is not. We take a simple maximum likelihood approach of these two probabilities. Here we take θ to be the parameters of the model, and in our case it is V and U:

Note that D ̃ is a “false” or “negative” corpus. Where we would have sentences like “stock boil fish is toy”. Unnatural sentences that should get a low probability of ever occurring. We can generate D on the fly by randomly sampling this negative from the word bank.

For skip-gram, our new objective function for observing the context word c m + j given the center word c would be:

Hierarchical Softmax

Mikolov et al. also present hierarchical softmax as a much more efficient alternative to the normal softmax. In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling works better for frequent words and lower dimensional vectors.

Hierarchical softmax uses a binary tree to represent all words in the vocabulary. Each leaf of the tree is a word, and there is a unique path from root to leaf. In this model, there is no output representation for words. Instead, each node of the graph (except the root and the leaves) is associated to a vector that the model is going to learn.

In this model, the probability of a word w given a vector wi, P(w|wi), is equal to the probability of a random walk starting in the root and ending in the leaf node corresponding to w. The main advantage in computing the probability this way is that the cost is only O(log(|V|)), corresponding to the length of the path.

Let’s introduce some notation. Let L(w) be the number of nodes in the path from the root to the leaf w. For instance, L(w2) in Figure 4 is 3. Let’s write n(w, i) as the i-th node on this path with associated vector vn(w,i). So n(w, 1) is the root, while n(w, L(w)) is the father of w. Now for each inner node n, we arbitrarily choose one of its children and call it ch(n) (e.g. always the left node). Then, denoting by σ(-) the sigmoid function, we can compute the probability as:

This formula is fairly dense, so let’s examine it more closely. First, we are computing a product of terms based on the shape of the path from the root (n(w, 1)) to the leaf (w). If we assume ch(n) is always the left node of n, then term [n(w, j + 1) = ch(n(w, j))] returns 1 when the path goes left, and -1 if right. Furthermore, the term [n(w, j + 1) = ch(n(w, j))] provides normal- ization. At a node n, if we sum the probabilities for going to the left and right node, you can check that for any value of vnT vwi:

The normalization also ensures that ∑|V| P(w|wi) = 1, just as in the original softmax. Finally, we compare the similarity of our input vector vwi to each inner node vector vTn(w,j) using a dot product. Let’s run through an example. Taking w2 in the figure below, we must take two left edges and then a right edge to reach w2 from the root:

To train the model, our goal is still to minimize the negative log likelihood -log P(w|wi ). But instead of updating output vectors per word, we update the vectors of the nodes in the binary tree that are in the path from root to leaf node.

The speed of this method is determined by the way in which the binary tree is constructed and words are assigned to leaf nodes. Mikolov et al. use a binary Huffman tree, which assigns frequent words shorter paths in the tree.

Conclusions

There are many more detailed posts on the Internet devoted to different types of softmax, including differentiated softmax, CNN softmax, target sampling, … I have tried to pay as much attention as possible to simple explanations of complicated formulas and advantages of the given algorithms, which make them the most popular in the given domain area. As it has been understood in the text, the mentioned methodologies are described more from the point of view of mathematical formulation and machine learning approach, than from the point of view of practical implementation (usually, these algorithms are already used in standard gensim/tensorflow libraries).

--

--