CS 224n Assignment #2: word2vec (44 Points)

1 Written: Understanding word2vec (26 points)

Let’s have a quick refresher on the word2vec algorithm. The key insight behind word2vec is that ‘a word

is known by the company it keeps’. Concretely, suppose we have a ‘center’ word c and a contextual window

surrounding c. We shall refer to words that lie in this contextual window as ‘outside words’. For example,

in Figure 1 we see that the center word c is ‘banking’. Since the context window size is 2, the outside words

are ‘turning’, ‘into’, ‘crises’, and ‘as’.

The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution P(O|C).

Given a specific word o and a specific word c, we want to calculate P(O = o|C = c), which is the probability

that word o is an ‘outside’ word for c, i.e., the probability that o falls within the contextual window of c.

Figure 1: The word2vec skip-gram prediction model with window size 2

In word2vec, the conditional probability distribution is given by taking vector dot-products and applying

the softmax function:

P(O = o | C = c) = exp(u

>

o vc)

P

w∈Vocab exp(u>

w vc)

(1)

Here, uo is the ‘outside’ vector representing outside word o, and vc is the ‘center’ vector representing center

word c. To contain these parameters, we have two matrices, U and V . The columns of U are all the ‘outside’

vectors uw. The columns of V are all of the ‘center’ vectors vw. Both U and V contain a vector for every

w ∈ Vocabulary.1

Recall from lectures that, for a single pair of words c and o, the loss is given by:

Jnaive-softmax(vc, o, U) = − log P(O = o|C = c). (2)

We can view this loss as the cross-entropy2 between the true distribution y and the predicted distribution yˆ.

Here, both y and yˆ are vectors with length equal to the number of words in the vocabulary. Furthermore,

the k

th entry in these vectors indicates the conditional probability of the k

th word being an ‘outside word’

for the given c. The true empirical distribution y is a one-hot vector with a 1 for the true outside word o,

and 0 everywhere else. The predicted distribution yˆ is the probability distribution P(O|C = c) given by our

model in equation (1).

1Assume that every word in our vocabulary is matched to an integer number k. Bolded lowercase letters represent vectors.

uk is both the k

th column of U and the ‘outside’ word vector for the word indexed by k. vk is both the k

th column of V and

the ‘center’ word vector for the word indexed by k. In order to simplify notation we shall interchangeably use k to

refer to the word and the index-of-the-word.

2The Cross Entropy Loss between the true (discrete) probability distribution p and another distribution q is −

P

i

pi log(qi).

1

CS 224n Assignment #2: word2vec (44 Points)

(a) (3 points) Show that the naive-softmax loss given in Equation (2) is the same as the cross-entropy loss

between y and yˆ; i.e., show that

−

X

w∈V ocab

yw log(ˆyw) = − log(ˆyo). (3)

Your answer should be one line.

(b) (5 points) Compute the partial derivative of Jnaive-softmax(vc, o, U) with respect to vc. Please write

your answer in terms of y, yˆ, and U. Note that in this course, we expect your final answers to follow

the shape convention.3 This means that the partial derivative of any function f(x) with respect to x

should have the same shape as x. For this subpart, please present your answer in vectorized form. In

particular, you may not refer to specific elements of y, yˆ, and U in your final answer (such as y1, y2,

. . .).

(c) (5 points) Compute the partial derivatives of Jnaive-softmax(vc, o, U) with respect to each of the ‘outside’

word vectors, uw’s. There will be two cases: when w = o, the true ‘outside’ word vector, and w 6= o,

for all other words. Please write your answer in terms of y, yˆ, and vc. In this subpart, you may use

specific elements within these terms as well, such as (y1, y2, . . .).

(d) (1 point) Compute the partial derivative of Jnaive-softmax(vc, o, U) with respect to U. Please write your

answer in terms of ∂J(vc,o,U)

∂u1

,

∂J(vc,o,U)

∂u2

, . . . ,

∂J(vc,o,U)

∂u|V ocab|

. The solution should be one or two lines long.

(e) (3 Points) The sigmoid function is given by Equation 4:

σ(x) = 1

1 + e−x

=

e

x

e

x + 1

(4)

Please compute the derivative of σ(x) with respect to x, where x is a scalar. Hint: you may want to

write your answer in terms of σ(x).

(f) (4 points) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive

Softmax loss. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity

of notation we shall refer to them as w1, w2, . . . , wK and their outside vectors as u1, . . . ,uK. For this

question, assume that the K negative samples are distinct. In other words, i 6= j implies wi 6= wj

for i, j ∈ {1, . . . , K}. Note that o /∈ {w1, . . . , wK}. For a center word c and an outside word o, the

negative sampling loss function is given by:

Jneg-sample(vc, o, U) = − log(σ(u

>

o vc)) −

X

K

k=1

log(σ(−u

>

k vc)) (5)

for a sample w1, . . . wK, where σ(·) is the sigmoid function.4

Please repeat parts (b) and (c), computing the partial derivatives of Jneg-sample with respect to vc,

with respect to uo, and with respect to a negative sample uk. Please write your answers in terms of

the vectors uo, vc, and uk, where k ∈ [1, K]. After you’ve done this, describe with one sentence why

this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be

able to use your solution to part (e) to help compute the necessary gradients here.

3This allows us to efficiently minimize a function using gradient descent without worrying about reshaping or dimension

mismatching. While following the shape convention, we’re guaranteed that θ := θ − α

∂J(θ)

∂θ is a well-defined update rule.

4Note: the loss function here is the negative of what Mikolov et al. had in their original paper, because we are doing a

minimization instead of maximization in our assignment code. Ultimately, this is the same objective function.

Page 2 of 4

CS 224n Assignment #2: word2vec (44 Points)

(g) (2 point) Now we will repeat the previous exercise, but without the assumption that the K sampled

words are distinct. Assume that K negative samples (words) are drawn from the vocabulary. For

simplicity of notation we shall refer to them as w1, w2, . . . , wK and their outside vectors as u1, . . . ,uK.

In this question, you may not assume that the words are distinct. In other words, wi = wj may be

true when i 6= j is true. Note that o /∈ {w1, . . . , wK}. For a center word c and an outside word o, the

negative sampling loss function is given by:

Jneg-sample(vc, o, U) = − log(σ(u

>

o vc)) −

X

K

k=1

log(σ(−u

>

k vc)) (6)

for a sample w1, . . . wK, where σ(·) is the sigmoid function.

Compute the partial derivative of Jneg-sample with respect to a negative sample uk. Please write your

answers in terms of the vectors vc and uk, where k ∈ [1, K]. Hint: break up the sum in the loss

function into two sums: a sum over all sampled words equal to uk and a sum over all sampled words

not equal to uk.

(h) (3 points) Suppose the center word is c = wt and the context window is [wt−m, . . ., wt−1, wt, wt+1,

. . ., wt+m], where m is the context window size. Recall that for the skip-gram version of word2vec,

the total loss for the context window is:

Jskip-gram(vc, wt−m, . . . wt+m, U) = X

−m≤j≤m

j6=0

J(vc, wt+j , U) (7)

Here, J(vc, wt+j , U) represents an arbitrary loss term for the center word c = wt and outside word

wt+j . J(vc, wt+j , U) could be Jnaive-softmax(vc, wt+j , U) or Jneg-sample(vc, wt+j , U), depending on your

implementation.

Write down three partial derivatives:

(i) ∂Jskip-gram(vc, wt−m, . . . wt+m, U)/∂U

(ii) ∂Jskip-gram(vc, wt−m, . . . wt+m, U)/∂vc

(iii) ∂Jskip-gram(vc, wt−m, . . . wt+m, U)/∂vw when w 6= c

Write your answers in terms of ∂J(vc, wt+j , U)/∂U and ∂J(vc, wt+j , U)/∂vc. This is very simple –

each solution should be one line.

Once you’re done: Given that you computed the derivatives of J(vc, wt+j , U) with respect to all the

model parameters U and V in parts (a) to (c), you have now computed the derivatives of the full loss

function Jskip-gram with respect to all parameters. You’re ready to implement word2vec!

2 Coding: Implementing word2vec (18 points)

In this part you will implement the word2vec model and train your own word vectors with stochastic gradient

descent (SGD). Before you begin, first run the following commands within the assignment directory in order

to create the appropriate conda virtual environment. This guarantees that you have all the necessary

packages to complete the assignment. Also note that you probably want to finish the previous math section

before writing the code since you will be asked to implement the math functions in Python. You want to

implement and test the following subsections in order since they are accumulative.

conda env create -f env.yml

conda activate a2

Once you are done with the assignment you can deactivate this environment by running:

Page 3 of 4

CS 224n Assignment #2: word2vec (44 Points)

conda deactivate

For each of the methods you need to implement, we included approximately how many lines of code our

solution has in the code comments. These numbers are included to guide you. You don’t have to stick to

them, you can write shorter or longer code as you wish. If you think your implementation is significantly

longer than ours, it is a signal that there are some numpy methods you could utilize to make your code both

shorter and faster. for loops in Python take a long time to complete when used over large arrays, so we

expect you to utilize numpy methods. We will be checking the efficiency of your code. You will be able to

see the results of the autograder when you submit your code to Gradescope, we recommend submitting

early and often.

(a) (12 points) We will start by implementing methods in word2vec.py. You can test a particular

method by running python word2vec.py m where m is the method you would like to test. For

example, you can test the sigmoid method by running python word2vec.py sigmoid.

(i) Implement the sigmoid method, which takes in a vector and applies the sigmoid function to it.

(ii) Implement the softmax loss and gradient in the naiveSoftmaxLossAndGradient method.

(iii) Implement the negative sampling loss and gradient in the negSamplingLossAndGradient

method.

(iv) Implement the skip-gram model in the skipgram method.

When you are done, test your entire implementation by running python word2vec.py.

(b) (4 points) Complete the implementation for your SGD optimizer in the sgd method of sgd.py. Test

your implementation by running python sgd.py.

(c) (2 points) Show time! Now we are going to load some real data and train word vectors with everything

you just implemented! We are going to use the Stanford Sentiment Treebank (SST) dataset to train

word vectors, and later apply them to a simple sentiment analysis task. You will need to fetch the

datasets first. To do this, run sh get datasets.sh. There is no additional code to write for this

part; just run python run.py.

Note: The training process may take a long time depending on the efficiency of your implementation

and the compute power of your machine (an efficient implementation takes one to two hours).

Plan accordingly!

After 40,000 iterations, the script will finish and a visualization for your word vectors will appear.

It will also be saved as word vectors.png in your project directory. Include the plot in your

homework write up. Briefly explain in at most three sentences what you see in the plot.

3 Submission Instructions

You shall submit this assignment on GradeScope as two submissions – one for “Assignment 2 [coding]”

and another for ‘Assignment 2 [written]”:

(a) Run the collect submission.sh script to produce your assignment2.zip file.

(b) Upload your assignment2.zip file to GradeScope to “Assignment 2 [coding]”.

(c) Upload your written solutions to GradeScope to “Assignment 2 [written]”.

Page 4 of 4