cs231n assignment3(RNN_Captioning)

cs231n assignment3(RNN_Captioning)

这次的作业内容是从 Image Caption 这个问题入手,即给定一张图片,生成对图片的文字描述。

大概的做法是这样的,用一个预训练的 CNN 把图片提取特征,然后那这个特征初始化 RNN(LSTM) 的 hidden state,用 RNN(LSTM) 生成一句话。

这里的 CNN 主要就是一个encoder,负责把图片压缩成一个语义向量,而 RNN(LSTM) 则是一个decoder,也是一个语言模型(language model),负责从这个语义向量解码出自然语言。

COCO dataset

本次实验用的数据是微软2014年发布的COCO dataset,这也是用来测试image caption的标准数据集。

COCO数据集包含80,000个训练集图像40,000验证集图像, each annotated with 5 captions written by workers on Amazon Mechanical Turk.

下面看其中一张样本。

Vanilla RNN Step

Forward Pass


RNN 的 step_forward 公式很简单,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def rnn_step_forward(x, prev_h, Wx, Wh, b):
"""
Inputs:
- x: Input data for this timestep, of shape (N, D).
- prev_h: Hidden state from previous timestep, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)

Returns a tuple of:
- next_h: Next hidden state, of shape (N, H)
- cache: Tuple of values needed for the backward pass.
"""

Z = np.dot(x, Wx) + np.dot(prev_h, Wh) + b
next_h = np.tanh(Z)

return next_h, cache

Backward Pass

反向传播和之前的全连接层也差不多,也挺简单。注意$\tanh(x)$的导数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def rnn_step_backward(dnext_h, cache):
"""
Backward pass for a single timestep of a vanilla RNN.

Inputs:
- dnext_h: Gradient of loss with respect to next hidden state, of shape (N, H)
- cache: Cache object from the forward pass

Returns a tuple of:
- dx: Gradients of input data, of shape (N, D)
- dprev_h: Gradients of previous hidden state, of shape (N, H)
- dWx: Gradients of input-to-hidden weights, of shape (D, H)
- dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
- db: Gradients of bias vector, of shape (H,)
"""
x, Wx, prev_h, Wh, b, next_h= cache
N, H = dnext_h.shape
# the derivative of tanh(x) is (1 - tanh(x)**2)
dZ = (1 - next_h ** 2) * dnext_h # of shape (N, H)

dx = np.dot(dZ, Wx.T) # of shape (N, D)
dWx = np.dot(x.T, dZ) # of shape (D, H)
dprev_h = np.dot(dZ, Wh.T) # of shape (N, H)
dWh = np.dot(prev_h.T, dZ) # of shape (H, H)
db = np.dot(np.ones((1, N)), dZ).flatten() # of shape (H,)

return dx, dprev_h, dWx, dWh, db

Vanilla RNN

如图,RNN的部分就是将前面的RNN step用for loop简单组装一下

Forward Pass

把前面写的forward step组合一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def rnn_forward(x, h0, Wx, Wh, b):
"""
Run a vanilla RNN forward on an entire sequence of data. We assume an input
sequence composed of T vectors, each of dimension D. The RNN uses a hidden
size of H, and we work over a minibatch containing N sequences. After running
the RNN forward, we return the hidden states for all timesteps.

Inputs:
- x: Input data for the entire timeseries, of shape (N, T, D).
- h0: Initial hidden state, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)

Returns a tuple of:
- h: Hidden states for the entire timeseries, of shape (N, T, H).
- cache: Values needed in the backward pass
"""
N, T, D = x.shape
_, H = h0.shape
cache = []
h = np.zeros((N, T, H))
for t in range(T):
h0, cache_tmp = rnn_step_forward(x[:, t, :], h0, Wx, Wh, b)
h[:, t, :] = h0
cache.append(cache_tmp)

return h, cache

Backward Pass

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def rnn_backward(dh, cache):
"""
Compute the backward pass for a vanilla RNN over an entire sequence of data.

Inputs:
- dh: Upstream gradients of all hidden states, of shape (N, T, H).

NOTE: 'dh' contains the upstream gradients produced by the
individual loss functions at each timestep, *not* the gradients
being passed between timesteps (which you'll have to compute yourself
by calling rnn_step_backward in a loop).

Returns a tuple of:
- dx: Gradient of inputs, of shape (N, T, D)
- dh0: Gradient of initial hidden state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
- db: Gradient of biases, of shape (H,)
"""
N, T, H = dh.shape
x, _, _, _, _, _ = cache[0]
_, D = x.shape
dx = np.zeros((N, T, D))
dWx = np.zeros((D, H))
dWh = np.zeros((H, H))
db = np.zeros((H,))
dh0 = np.zeros((N, H))
for t in range(T-1, -1, -1):
dx_tmp, dh0, dWx_tmp, dWh_tmp, db_tmp = rnn_step_backward(dh[:, t, :] + dh0, cache[t])
dx[:,t,:] = dx_tmp
dWx += dWx_tmp
dWh += dWh_tmp
db += db_tmp

return dx, dh0, dWx, dWh, db

Word Embedding

另外还要实现 word embedding 层,现在我们的就是给定词表(词表有V个词)中的下标(下标的范围是0 <= idx < V),映射到 D 维向量。

我们的RNN在时间序列的长度是T,也就是说要循环T次。所以,我们输入的一条样本$x_i$应该是shape =(T,)可以理解成一个有T个字符。但我们每次输入的是一个batch(batch size用N表示),即每次输入的样本shape =(N,T)。

这N*T每个都是一个字符(字符的种类应该不超过V种),word embedding就是把这些字符用相应的编码来替换(如本来是”a”,用一个长度为D的编码假如是”00001”来替换),经过word embdding后的x就是shape = (N,T,D)。

Forward Pass

把X中的每个元素(每个元素都属于0 <= idx < V的范围),用w中的编码表示。

实现可以先用for loop的写法,加深理解。

1
2
3
4
5
6
7
N, T = x.shape
V, D = W.shape
out = np.zeros((N, T, D))
for n in range(N):
for t in range(T):
tmp = W[x[n,t]]
out[n, t, :] = tmp

简洁写法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def word_embedding_forward(x, W):
"""
Forward pass for word embeddings. We operate on minibatches of size N where
each sequence has length T. We assume a vocabulary of V words, assigning each
word to a vector of dimension D.

Inputs:
- x: Integer array of shape (N, T) giving indices of words. Each element idx
of x muxt be in the range 0 <= idx < V.
- W: Weight matrix of shape (V, D) giving word vectors for all words.

Returns a tuple of:
- out: Array of shape (N, T, D) giving word vectors for all input words.
- cache: Values needed for the backward pass
"""
out = W[x]
cache = x, W

return out, cache

Backward Pass

backword要用到np.add.at()函数,参考:numpy.ufunc.at

同样可以先写出for loop的版本

1
2
3
4
5
x, W = cache
dW = np.zeros(W.shape)
for n in range(N):
for t in range(T):
dW[x[n, t]] += dout[n, t, :]

简洁写法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def word_embedding_backward(dout, cache):
"""
Backward pass for word embeddings. We cannot back-propagate into the words
since they are integers, so we only return gradient for the word embedding
matrix.

HINT: Look up the function np.add.at

Inputs:
- dout: Upstream gradients of shape (N, T, D)
- cache: Values from the forward pass

Returns:
- dW: Gradient of word embedding matrix, of shape (V, D).
"""

N, T, D = dout.shape
x, W = cache
dW = np.zeros(W.shape)
np.add.at(dW, x, dout)

return dW

Loss and grads

loss和grads就按照他给的层结构进行组装,然后再backward就ok了

1
2
3
4
5
6
7
8
9
10
h0, cache1 = affine_forward(features, W_proj, b_proj)   # h0 (N, H)
x, cache2 = word_embedding_forward(captions_in, W_embed) # x (N, T, D)
h, cache3 = rnn_forward(x, h0, Wx, Wh, b)
scores, cache4 = temporal_affine_forward(h, W_vocab, b_vocab)
loss, dscores = temporal_softmax_loss(scores, captions_out, mask)

dh, grads['W_vocab'], grads['b_vocab'] = temporal_affine_backward(dscores, cache4)
dx, dh0, grads['Wx'], grads['Wh'], grads['b'] = rnn_backward(dh, cache3)
grads['W_embed'] = word_embedding_backward(dx, cache2)
_, grads['W_proj'], grads['b_proj'] = affine_backward(dh0, cache1)

Sample

Sample是再测试的时候,根据输入的图片特征,自己生成captions。

具体过程如下:

1.输入:

features 是经过CNN训练后得出来一个batch图片的特征,维度时$[N\times D]$
max_length 是RNN训练的轮数(时间序列的长度)

2.预处理:获取$h_0$,$x_1$


第一步,对特征维度进行处理,变成$h_0$(N,H),即RNN中h的初始值。之后将进行对图像特征的“解码”工作。
第二步,RNN的第一个$x_1$是<start>这个字符,这是$x_1$的原始形态(N,)的一个向量,每个向量包含<start>这个字符,但这不是$x_1$,需要一个embedding操作,进行编码得到$x_1$(N,D)。

3.循环max_length次rnn step:

这个就很简单了,如图根据$h_{t-1}$和$x_t$算出$h_t$;
再根据$h_t$算出预测的$y_t$;
对$y_t$进行embedding就是下次的输入$x_{t+1}$;

4.最后的caption就是我们所有层输入的字符(进行embedding之前的字符)。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def sample(self, features, max_length=30):
"""
Inputs:
- features: Array of input image features of shape (N, D).
- max_length: Maximum length T of generated captions.

Returns:
- captions: Array of shape (N, max_length) giving sampled captions,where each element is an integer in the range [0, V). The first element of captions should be the first sampled word, not the <START> token.
"""
N = features.shape[0]
captions = self._null * np.ones((N, max_length), dtype=np.int32)

# Unpack parameters
W_proj, b_proj = self.params['W_proj'], self.params['b_proj']
W_embed = self.params['W_embed']
Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']
W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

h_prev, _ = affine_forward(features, W_proj, b_proj) # h0 (N, H)
cap_prev = np.repeat(self._start, N)
# rnn时间序列的第一层,我们输入了batch size为N的N个字母(编码),所以输入为(N,D)
# 当然,以后每层的输入都这样
captions[:, 0] = cap_prev
for t in range(1, max_length):
# 根据前一层的预测出的N个字母,作为本层的输入。首先进行编码
x = W_embed[cap_prev] # (N, D)
# 编码之后,作为x_t进行输入,计算本层的输出h_t
h_next, _ = rnn_step_forward(x, h_prev, Wx, Wh, b)
# 根据本层的h_t计算本层的预测字母y_t(每个batch各一个)
scores, _ = affine_forward(h_next, W_vocab, b_vocab)
cap_prev = np.argmax(scores, axis=1)
# 把预测出来的字母保存下来
captions[:, t] = cap_prev
h_prev = h_next

return captions
-------------The End-------------