cs231n assignment2(ConvolutionalNetworks)

cs231n assignment2(ConvolutionalNetworks)

Convolution: Naive forward pass

Input data of shape: $(N, C, H_{prev}, W_{prev})$
其中$N$是样本数,$C$是channel数。
下一层的$H,W$可由以下公式得出:

Output data of shape: $(N, F, H, W)$
其中$F$是本层filters的个数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def conv_forward_naive(x, w, b, conv_param):
"""
A naive implementation of the forward pass for a convolutional layer.

The input consists of N data points, each with C channels, height H and
width W. We convolve each input with F different filters, where each filter
spans all C channels and has height HH and width WW.

Input:
- x: Input data of shape (N, C, H, W)
- w: Filter weights of shape (F, C, HH, WW)
- b: Biases, of shape (F,)
- conv_param: A dictionary with the following keys:
- 'stride': The number of pixels between adjacent receptive fields in the
horizontal and vertical directions.
- 'pad': The number of pixels that will be used to zero-pad the input.


During padding, 'pad' zeros should be placed symmetrically (i.e equally on both sides)
along the height and width axes of the input. Be careful not to modfiy the original
input x directly.

Returns a tuple of:
- out: Output data, of shape (N, F, H', W') where H' and W' are given by
H' = 1 + (H + 2 * pad - HH) / stride
W' = 1 + (W + 2 * pad - WW) / stride
- cache: (x, w, b, conv_param)
"""
###########################################################################
# TODO: Implement the convolutional forward pass. #
# Hint: you can use the function np.pad for padding. #
###########################################################################
N, C, H_prev, W_prev = x.shape
F, _, HH, WW = w.shape
stride = conv_param['stride'] # stride
pad = conv_param['pad'] # pad

# python3除法的结果都是float
H_out = 1 + (H_prev + 2 * pad - HH) // stride
W_out = 1 + (W_prev + pad * 2 - WW) // stride
Z = np.zeros((N, F, H_out, W_out))
# zero padding
x_pad = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant', constant_values=0)

for i in range(N):
for h_i in range(H_out):
for w_i in range(W_out):

h_start = h_i * stride
h_end = h_start + HH
w_start = w_i * stride
w_end = w_start + WW

xi_slice = x[i, :, h_start:h_end, w_start:w_end]
for f in range(F): # loop over channels (= #filters) of the output volume
Z[i, f, h_i, w_i] = np.sum(np.multiply(xi_slice, w[f])) + b[f]
out = Z
###########################################################################
# END OF YOUR CODE #
###########################################################################
cache = (x, w, b, conv_param)
return out, cache

Convolution: Naive backward pass

在Forward Pass中,最关键的就这一句代码了

1
Z[i, f, h_i, w_i] = np.sum(np.multiply(xi_slice, w[f])) + b[f]

反向传播中,就根据这个来进行求导:

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def conv_backward_naive(dout, cache):
"""
A naive implementation of the backward pass for a convolutional layer.

Inputs:
- dout: Upstream derivatives.
- cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive

Returns a tuple of:
- dx: Gradient with respect to x
- dw: Gradient with respect to w
- db: Gradient with respect to b
"""
###########################################################################
# TODO: Implement the convolutional backward pass. #
###########################################################################
x, w, b, conv_param = cache
stride = conv_param['stride'] # stride
pad = conv_param['pad'] # pad

N, C, H_prev, W_prev = x.shape
F, _, HH, WW = w.shape
N, F, H_out, W_out = dout.shape
# zero padding
x_pad = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant', constant_values=0)
dx = np.zeros(x.shape)
dx_pad = np.zeros(x_pad.shape)
dw = np.zeros(w.shape)
db = np.zeros(b.shape)
for i in range(N): # loop over the batch of training examples
xi = x_pad[i] # Select ith training example's padded activation
for h_i in range(H_out): # loop over vertical axis of the output volume
for w_i in range(W_out): # loop over horizontal axis of the output volume
h_start = h_i * stride
h_end = h_start + HH
w_start = w_i * stride
w_end = w_start + WW
xi_slice = xi[:, h_start:h_end, w_start:w_end]
for f in range(F): # loop over channels (= #filters) of the output volume
dx_pad[i, :, h_start:h_end, w_start:w_end] += w[f] * dout[i, f, h_i, w_i]
dw[f] += xi_slice * dout[i, f, h_i, w_i]
db[f] += dout[i, f, h_i, w_i]
dx[i, :, :, :] = dx_pad[i,:, pad:-pad, pad:-pad]
###########################################################################
# END OF YOUR CODE #
###########################################################################
return dx, dw, db

Max-Pooling: Naive forward

Max-Pooling就对Convolution层简化一下就可以了。

Input data of shape: $(N, C, H_{prev}, W_{prev})$
其中$N$是样本数,$C$是channel数。

下一层的$H,W$:(没有考虑padding)

Output data of shape: $(N, C, H, W)$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def max_pool_forward_naive(x, pool_param):
"""
A naive implementation of the forward pass for a max-pooling layer.

Inputs:
- x: Input data, of shape (N, C, H, W)
- pool_param: dictionary with the following keys:
- 'pool_height': The height of each pooling region
- 'pool_width': The width of each pooling region
- 'stride': The distance between adjacent pooling regions

No padding is necessary here. Output size is given by

Returns a tuple of:
- out: Output data, of shape (N, C, H', W') where H' and W' are given by
H' = 1 + (H - pool_height) / stride
W' = 1 + (W - pool_width) / stride
- cache: (x, pool_param)
"""
out = None
###########################################################################
# TODO: Implement the max-pooling forward pass #
###########################################################################
N, C, H_prev, W_prev = x.shape
pool_height = pool_param['pool_height']
pool_width = pool_param['pool_width']
stride = pool_param['stride'] # stride

# python3除法的结果都是float
H_out = 1 + (H_prev - pool_height) // stride
W_out = 1 + (W_prev - pool_width) // stride
Z = np.zeros((N, C, H_out, W_out))

for i in range(N): # loop over the batch of training examples
for h_i in range(H_out): # loop over vertical axis of the output volume
for w_i in range(W_out): # loop over horizontal axis of the output volume
h_start = h_i * stride
h_end = h_start + pool_height
w_start = w_i * stride
w_end = w_start + pool_width
for c in range(C): # loop over channels (= #filters) of the output volume
Z[i, c, h_i, w_i] = np.max(x[i, c, h_start:h_end, w_start:w_end])
out = Z
###########################################################################
# END OF YOUR CODE #
###########################################################################
cache = (x, pool_param)
return out, cache

Max-Pooling: Naive backward

对Max-Pooling的求导类似于max函数,最大为1,否则为0.

1
2
3
4
mask = (x == np.max(x))
##########等价于########
mask[i,j] = True if X[i,j] = x
mask[i,j] = False if X[i,j] != x

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def max_pool_backward_naive(dout, cache):
"""
A naive implementation of the backward pass for a max-pooling layer.

Inputs:
- dout: Upstream derivatives
- cache: A tuple of (x, pool_param) as in the forward pass.

Returns:
- dx: Gradient with respect to x
"""
###########################################################################
# TODO: Implement the max-pooling backward pass #
###########################################################################
x, pool_param = cache
pool_height = pool_param['pool_height']
pool_width = pool_param['pool_width']
stride = pool_param['stride'] # stride

N, C, H_out, W_out = dout.shape

dx = np.zeros(x.shape)
for i in range(N): # loop over the batch of training examples
for h_i in range(H_out): # loop over vertical axis of the output volume
for w_i in range(W_out): # loop over horizontal axis of the output volume
h_start = h_i * stride
h_end = h_start + pool_height
w_start = w_i * stride
w_end = w_start + pool_width
for c in range(C): # loop over channels (= #filters) of the output volume
x_slice = x[i, c, h_start:h_end, w_start:w_end]
mask = (x_slice == np.max(x_slice))
dx[i, c, h_start:h_end, w_start:w_end] += mask * dout[i, c, h_i, w_i]
###########################################################################
# END OF YOUR CODE #
###########################################################################
return dx

Spatial batch normalization: forward

由于维度的差别,卷积网络的Batch Normalization和全连接网络略有不同,卷积层的输入是$(N,C,H,W)$,BN是对每个batch的每个channel进行Normalization。及将每个channel看成全连接的一个属性:

1
x_reshaped = x.transpose(0, 2, 3, 1).reshape(N * H * W, C)

x_reshaped就是新的$(N’,D’)$的输入。
代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def spatial_batchnorm_forward(x, gamma, beta, bn_param):
"""
Computes the forward pass for spatial batch normalization.

Inputs:
- x: Input data of shape (N, C, H, W)
- gamma: Scale parameter, of shape (C,)
- beta: Shift parameter, of shape (C,)
- bn_param: Dictionary with the following keys:
- mode: 'train' or 'test'; required
- eps: Constant for numeric stability
- momentum: Constant for running mean / variance. momentum=0 means that
old information is discarded completely at every time step, while
momentum=1 means that new information is never incorporated. The
default of momentum=0.9 should work well in most situations.
- running_mean: Array of shape (D,) giving running mean of features
- running_var Array of shape (D,) giving running variance of features

Returns a tuple of:
- out: Output data, of shape (N, C, H, W)
- cache: Values needed for the backward pass
"""
out, cache = None, []

###########################################################################
# TODO: Implement the forward pass for spatial batch normalization. #
# #
# HINT: You can implement spatial batch normalization by calling the #
# vanilla version of batch normalization you implemented above. #
# Your implementation should be very short; ours is less than five lines. #
###########################################################################
N, C, H, W = x.shape
x_reshaped = x.transpose(0, 2, 3, 1).reshape(N * H * W, C)
out_tmp, cache = batchnorm_forward(x_reshaped, gamma, beta, bn_param)
out = out_tmp.reshape(N, H, W, C).transpose(0, 3, 1, 2)

return out, cache

Spatial batch normalization: backward

一样的思想应用于反向传播。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def spatial_batchnorm_backward(dout, cache):
"""
Computes the backward pass for spatial batch normalization.

Inputs:
- dout: Upstream derivatives, of shape (N, C, H, W)
- cache: Values from the forward pass

Returns a tuple of:
- dx: Gradient with respect to inputs, of shape (N, C, H, W)
- dgamma: Gradient with respect to scale parameter, of shape (C,)
- dbeta: Gradient with respect to shift parameter, of shape (C,)
"""
dx, dgamma, dbeta = None, None, None

###########################################################################
# TODO: Implement the backward pass for spatial batch normalization. #
# #
# HINT: You can implement spatial batch normalization by calling the #
# vanilla version of batch normalization you implemented above. #
# Your implementation should be very short; ours is less than five lines. #
###########################################################################
N, C, H, W = dout.shape
dout_reshaped = dout.transpose(0, 2, 3, 1).reshape(N * H * W, C)
dx_reshaped, dgamma, dbeta = batchnorm_backward(dout_reshaped, cache)
dx = dx_reshaped.reshape(N, H, W, C).transpose(0, 3, 1, 2)
###########################################################################
# END OF YOUR CODE #
###########################################################################

return dx, dgamma, dbeta

Spatial group normalization: forward

Group Normalization是18年才提出来的一个方法,算是Layer Normalization的一种变形把,不同点是,将C个channels分成G个Group,对每个Group分别Normalizaiton。

paper给的效果图,CNN中也一般,可能batch小一点的时候可以选用一下。

实现上,首先考类似上面那个BN的卷积版,每次要把一个batch$(N,C,H,W)$的哪些数据进行Normalization?

对于group normalization来说,

1
2
num = C // G
x_group = x.reshape(N*G, num*H*W)

就可以类似BN(或者LN)的应用了。
另外要注意,gamma和beta是,所以当涉及它们的运算的时候还要变成$(N, C,H,W)$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def spatial_groupnorm_forward(x, gamma, beta, G, gn_param):
"""
Computes the forward pass for spatial group normalization.
In contrast to layer normalization, group normalization splits each entry
in the data into G contiguous pieces, which it then normalizes independently.
Per feature shifting and scaling are then applied to the data, in a manner identical to that of batch normalization and layer normalization.

Inputs:
- x: Input data of shape (N, C, H, W)
- gamma: Scale parameter, of shape (C,)
- beta: Shift parameter, of shape (C,)
- G: Integer number of groups to split into, should be a divisor of C
- gn_param: Dictionary with the following keys:
- eps: Constant for numeric stability

Returns a tuple of:
- out: Output data, of shape (N, C, H, W)
- cache: Values needed for the backward pass
"""
out, cache = None, None
eps = gn_param.get('eps',1e-5)
###########################################################################
# TODO: Implement the forward pass for spatial group normalization. #
# This will be extremely similar to the layer norm implementation. #
# In particular, think about how you could transform the matrix so that #
# the bulk of the code is similar to both train-time batch normalization #
# and layer normalization! #
###########################################################################
N, C, H, W = x.shape
num = C // G
x_group = x.reshape(N*G, num*H*W)
# adapt gamma/beta to the function

# Same as LayerNorm Forward Pass
x_T = x_group.T
x_mean_T = np.mean(x_T, axis=0)
x_var_T = np.var(x_T, axis=0)

x_group_hat_T = (x_T - x_mean_T) / np.sqrt(x_var_T + eps)
x_group_hat = x_group_hat_T.T # shape of (N*G, num*H*W)

# shape of (N, C, H, W) to cal the gamma and beta
x_hat = x_group_hat.reshape(N, C, H, W)

out = gamma * x_hat + beta
cache = (G, x_T, x_hat, x_mean_T, x_var_T, gamma, beta, eps)
###########################################################################
# END OF YOUR CODE #
###########################################################################
return out, cache

Spatial group normalization: backward

反向传播稍微有点难??调的时候忽略了把LN代码中的N改为tmp_N = num * H * W

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

def spatial_groupnorm_backward(dout, cache):
"""
Computes the backward pass for spatial group normalization.

Inputs:
- dout: Upstream derivatives, of shape (N, C, H, W)
- cache: Values from the forward pass

Returns a tuple of:
- dx: Gradient with respect to inputs, of shape (N, C, H, W)
- dgamma: Gradient with respect to scale parameter, of shape (C,)
- dbeta: Gradient with respect to shift parameter, of shape (C,)
"""
###########################################################################
# TODO: Implement the backward pass for spatial group normalization. #
# This will be extremely similar to the layer norm implementation. #
###########################################################################
N, C, H, W = dout.shape
G, x, x_hat, x_mean, x_var, gamma, beta, eps = cache
num = C // G
# dx_hat (N, C, H, W)
dx_hat = dout * gamma

# Calculation with the shape of(N*G, num*H*W)
dx_hat = dx_hat.reshape(N*G, num*H*W)
dx_hat_T = dx_hat.T
tmp_N = num * H * W

dx_var = -0.5 * np.sum(dx_hat_T * (x - x_mean) * np.power(x_var + eps, -3 / 2), axis=0)
dx_mean = np.sum(dx_hat_T * (-1 / np.sqrt(x_var + eps)), axis=0) + np.sum(-2 * dx_var * (x - x_mean), axis=0) / tmp_N

dx_group_T = dx_hat_T / np.sqrt(x_var + eps) + dx_var * 2 * (x - x_mean) / tmp_N + dx_mean / tmp_N
dx = dx_group_T.T.reshape(N, C, H, W)

dgamma = np.sum(dout * x_hat, axis=(0,2,3)).reshape(1, C, 1, 1)
dbeta = np.sum(dout, axis=(0,2,3)).reshape(1, C, 1, 1)
###########################################################################
# END OF YOUR CODE #
###########################################################################
return dx, dgamma, dbeta

-------------The End-------------