cs231n assignment3(LSTM_Captioning)

cs231n assignment3(LSTM_Captioning)

上个实验中的virtual RNN,存在一个问题,就是无法解决长期依赖问题(long-term dependencies)。当时间序列变得很长的时候,前后信息的关联度会越来越小,直至消失,即所谓的梯度消失现象。

GRU(Gated Recurrent Unit)单元

GRU用了个更新门的东西(它出现的时间比LSTM要晚),来维持一个长时间的记忆。

看一个例子

The cat, which already ate …, was full.

这里我们看到,cat和was之间隔了一个很长的定语从句,在这之间RNN可能记不到之前的信息。

于是这里用一个c = memory cell来表示长期的记忆;一个更新门$f_u$来确定是否对c进行更新,因为可能并不是每次都对c进行改变,可以达到一个长期记忆的效果。

公式如下:

Sigma函数使$f_u$很容易就非常接近1或0。

LSTM原理

有了门的概念之后,理解LSTM会比较容易了。

一个LSTM单元有三个门,后面介绍。

遗忘门

遗忘门决定对上次的$c^{}$“遗忘程度”。

更新门

更新门决定对本次计算新生成的$\tilde{c}^{}$“保留程度”

同时计算一下memory cell的值

有了更新门和遗忘门还有新的memory cell,就可以对memory cell进行更新

输出门

更新门和遗忘门都是针对$c^{}$的更新而言的,输出门则是针对$f^{}$来说的。

以上就是标准的LSTM,当然它还有很多变体这里就不再介绍。下面是作业内容。

LSTM Step

Forward Pass

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):
"""
Inputs:
- x: Input data, of shape (N, D)
- prev_h: Previous hidden state, of shape (N, H)
- prev_c: previous cell state, of shape (N, H)
- Wx: Input-to-hidden weights, of shape (D, 4H)
- Wh: Hidden-to-hidden weights, of shape (H, 4H)
- b: Biases, of shape (4H,)

Returns a tuple of:
- next_h: Next hidden state, of shape (N, H)
- next_c: Next cell state, of shape (N, H)
- cache: Tuple of values needed for backward pass.
"""
N, D = x.shape
A = np.dot(x, Wx) + np.dot(prev_h, Wh) + b
A = A.reshape(N, 4, -1)
next_c = sigmoid(A[:,1,:]) * prev_c + sigmoid(A[:,0,:]) * np.tanh(A[:,3,:])
next_h = sigmoid(A[:,2,:]) * np.tanh(next_c)

cache = (A, x, prev_h, prev_c, Wx, Wh, b, next_h, next_c)

return next_h, next_c, cache

Backward Pass

不用推公式,用计算图递归求导就好了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def lstm_step_backward(dnext_h, dnext_c, cache):
"""
Backward pass for a single timestep of an LSTM.

Inputs:
- dnext_h: Gradients of next hidden state, of shape (N, H)
- dnext_c: Gradients of next cell state, of shape (N, H)
- cache: Values from the forward pass

Returns a tuple of:
- dx: Gradient of input data, of shape (N, D)
- dprev_h: Gradient of previous hidden state, of shape (N, H)
- dprev_c: Gradient of previous cell state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
A, x, prev_h, prev_c, Wx, Wh, b, next_h, next_c = cache

dsigmoid_a2 = dnext_h * np.tanh(next_c)
dnext_c += dnext_h * sigmoid(A[:, 2, :]) * (1 - np.tanh(next_c) ** 2)
dsigmoid_a1 = dnext_c * prev_c
dsigmoid_a0 = dnext_c * np.tanh(A[:, 3, :])
dtanh_a3 = dnext_c * sigmoid(A[:, 0, :])
dA = np.zeros(A.shape)
dA[:, 0, :] = dsigmoid_a0 * sigmoid(A[:, 0, :]) * (1 - sigmoid(A[:, 0, :]))
dA[:, 1, :] = dsigmoid_a1 * sigmoid(A[:, 1, :]) * (1 - sigmoid(A[:, 1, :]))
dA[:, 2, :] = dsigmoid_a2 * sigmoid(A[:, 2, :]) * (1 - sigmoid(A[:, 2, :]))
dA[:, 3, :] = dtanh_a3 * (1 - np.tanh(A[:, 3, :]) ** 2)

dprev_c = dnext_c * sigmoid(A[:, 1, :])
N, D = x.shape
dA = dA.reshape(N, -1)

dx = dA.dot(Wx.T)
dWx = x.T.dot(dA)
dprev_h = dA.dot(Wh.T)
dWh = prev_h.T.dot(dA)
db = np.ones((N,)).dot(dA)

return dx, dprev_h, dprev_c, dWx, dWh, db

LSTM

这一部分与RNN差不多。

Forward Pass

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def lstm_forward(x, h0, Wx, Wh, b):
"""
Forward pass for an LSTM over an entire sequence of data.

Inputs:
- x: Input data of shape (N, T, D)
- h0: Initial hidden state of shape (N, H)
- Wx: Weights for input-to-hidden connections, of shape (D, 4H)
- Wh: Weights for hidden-to-hidden connections, of shape (H, 4H)
- b: Biases of shape (4H,)

Returns a tuple of:
- h: Hidden states for all timesteps of all sequences, of shape (N, T, H)
- cache: Values needed for the backward pass.
"""
N, T, D = x.shape
H = h0.shape[1]
cache = []
h = np.zeros((N, T, H))
prev_h = h0
prev_c = np.zeros((N, H))
for t in range(T):
prev_h, prev_c, cache_t = lstm_step_forward(x[:, t, :], prev_h, prev_c, Wx, Wh, b)
cache.append(cache_t)
h[:, t, :] = prev_h

return h, cache

Backward Pass

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def lstm_backward(dh, cache):
"""
Backward pass for an LSTM over an entire sequence of data.]

Inputs:
- dh: Upstream gradients of hidden states, of shape (N, T, H)
- cache: Values from the forward pass

Returns a tuple of:
- dx: Gradient of input data of shape (N, T, D)
- dh0: Gradient of initial hidden state of shape (N, H)
- dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
_, x, prev_h, prev_c, Wx, Wh, b, next_h, next_c = cache[0]
N, T, H = dh.shape
D = x.shape[1]
dx = np.zeros((N, T, D))
dh0 = np.zeros((N, H))
dWx = np.zeros((D, 4 * H))
dWh = np.zeros((H, 4 * H))
db = np.zeros((4*H))
dprev_c = np.zeros((N, H))
for t in range(T-1, -1, -1):
dx_t, dh0, dprev_c, dWx_t, dWh_t, db_t = lstm_step_backward(dh[:, t, :] + dh0, dprev_c, cache[t])
dx[:,t,:] = dx_t
dWx += dWx_t
dWh += dWh_t
db += db_t
return dx, dh0, dWx, dWh, db

loss,grads,sample稍微修改一点就好。

-------------The End-------------