cs231n assignment2(BatchNormalization)

cs231n assignment2(BatchNormalization)

参考论文:Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

论文笔记:Batch Normalization

在机器学习中,当机器学习方法的输入数据包含零均值和单位方差的不相关特征时,机器学习方法往往更有效。
Batch Normalization即将这个想法,应用到了每层网络的输入中。

BatchNorMalization实现

BN forward

归一化之后添加了两个scale and shift参数,使其压缩范围可以灵活变化。
公式如下,

注意:因为训练中计算均值和方差的时候,用的是batch来算的;测试时的mean和var使用训练时的指数加权平均数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
if mode == 'train':
# cal the x_hat and y
x_mean = np.mean(x, axis=0)
x_var = np.var(x, axis=0)

x_hat = (x - x_mean) / np.sqrt(x_var + eps)
out = gamma * x_hat + beta
# cal the running_paras
running_mean = momentum * running_mean + (1 - momentum) * x_mean
running_var = momentum * running_var + (1 - momentum) * x_var
cache = (x, x_hat, x_mean, x_var, gamma, beta, eps)
elif mode == 'test':
x_hat = (x - running_mean) / np.sqrt(running_var + eps)
out = gamma * x_hat + beta
else:
raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

BN backward

参考博文:传送门
对于每个batch,采用计算图求导的方法,进行求导。

计算图如下:

forward pass

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
N, D = x.shape

# step1: calculate mean
mu = 1./N * np.sum(x, axis=0)

# step2: subtract mean vector of every trainings example
xmu = x - mu

# step3: following the lower branch - calculation denominator
sq = xmu ** 2

# step4: calculate variance
var = 1./N * np.sum(sq, axis=0)

# step5: add eps for numerical stability, then sqrt
sqrtvar = np.sqrt(var + eps)

# step6: invert sqrtvar
ivar = 1./sqrtvar

# step7: execute normalization
xhat = xmu * ivar

# step8: Nor the two transformation steps
gammax = gamma * xhat

# step9
out = gammax + beta

# store intermediate
cache = (xhat, gamma, xmu, ivar, sqrtvar, var, eps)

backward pass

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# unfold the variables stored in cache
xhat, gamma, xmu, ivar, sqrtvar, var, eps = cache

# get the dimensions of the input/output
N, D = dout.shape

# step9
dbeta = np.sum(dout, axis=0)
dgammax = dout # not necessary, but more understandable

# step8
dgamma = np.sum(dgammax * xhat, axis = 0)
dxhat = dgammax * gamma

# step7
divar = np.sum(dxhat * xmu, axis=0)
dxmu1 = dxhat * ivar

# step6
dsqrtvar = -1./(sqrtvar**2) * divar

# step5
dvar = 0.5 * 1./np.sqrt(var+eps) * dsqrtvar

# step4
dsq = 1./N * np.ones((N, D)) * dvar

# step3
dxmu2 = 2 * xmu * dsq

# step2
dx1 = (dxmu1 + dxmu2)
dmu = -1 * np.sum(dxmu1 + dxmu2, axis=0)

# step1
dx2 = 1./N * np.ones((N, D)) * dmu

# step0
dx = dx1 + dx2

BN alternative backward

将上文的梯度进行总和就得出了,链式求导的最终公式,论文也有给出。

关键代码

1
2
3
4
5
6
7
8
9
10
x, x_hat, x_mean, x_var, gamma, beta, eps = cache
N, D = x.shape

dx_hat = dout * gamma
dx_var = -0.5 * np.sum(dx_hat * (x - x_mean) * np.power(x_var + eps, -3 / 2), axis=0)
dx_mean = np.sum(dx_hat * (-1. / np.sqrt(x_var + eps)), axis=0) + np.sum(-2 * dx_var * (x - x_mean), axis=0) / N

dx = dx_hat / np.sqrt(x_var + eps) + dx_var * 2 * (x - x_mean) / N + dx_mean / N
dgamma = np.sum(dout * x_hat, axis=0)
dbeta = np.sum(dout, axis=0)

Fully Connected Nets with BN

实现了BN模块,接着就在上节实验中的Fully Connected Nets类中加入BN层。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# __init__ function
if self.normalization=='batchnorm':
for i in range(1, L):
self.params['gamma'+str(i)] = np.ones(hidden_dims[i-1])
self.params['beta'+str(i)] = np.zeros(hidden_dims[i-1])

# loss function
A = X
L = self.num_layers
for i in range(1, L):
Z, params['cache1'+str(i)] = affine_forward(A, self.params['W'+str(i)], self.params['b'+str(i)])
if self.normalization=='batchnorm':
Z, params['cache2'+str(i)] = batchnorm_forward(Z, self.params['gamma'+str(i)], self.params['beta'+str(i)], self.bn_params[i-1])
A, params['cache3'+str(i)] = relu_forward(Z)
scores, cache = affine_forward(A, self.params['W'+str(L)], self.params['b'+str(L)])

# If test mode return early
if mode == 'test':
return scores

loss, dscores = softmax_loss(scores, y)
sum_W_norm = 0.0
for i in range(self.num_layers):
sum_W_norm += np.sum(self.params['W'+str(i+1)]**2)
loss += 0.5 * self.reg * sum_W_norm

dA, grads['W'+str(L)], grads['b'+str(L)] = affine_backward(dscores, cache)
for i in range(L-1, 0, -1):
dZ = relu_backward(dA, params['cache3'+str(i)])
if self.normalization == 'batchnorm':
dZ, grads['gamma'+str(i)], grads['beta'+str(i)] = batchnorm_backward_alt(dZ, params['cache2'+str(i)])
dA, grads['W'+str(i)], grads['b'+str(i)] = affine_backward(dZ, params['cache1' + str(i)])
for i in range(L):
grads['W'+str(i+1)] += self.reg * self.params['W'+str(i+1)]

Batch Normalization的实验结果

激活函数:ReLU
优化方法:Adam

with or without batch normalization

在5个隐层,每个隐层100个神经元的网络中,用1000个训练数据进行测试。分别用或者不用batch normalization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
weight_scale = 2e-2
bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')
model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)

bn_solver = Solver(bn_model, small_data,
num_epochs=10, batch_size=50,
update_rule='adam',
optim_config={
'learning_rate': 1e-3,
},
verbose=True,print_every=20)
bn_solver.train()

solver = Solver(model, small_data,
num_epochs=10, batch_size=50,
update_rule='adam',
optim_config={
'learning_rate': 1e-3,
},
verbose=True, print_every=20)
solver.train()



可以看到Adam优化,同样的学习率下,用batch normalization的网络loss下降更为快速,致使训练集拟合的更好。(两个泛化都很低)

different weight initialization scale

采用20种不同的weight initialization scale,查看with or without normalization的表现。



可以看出,在用较小的weight scale时,with norm的表现要更好。

different batch size

因为Batch Normalization是对每个batch单独进行标准化,所以当batch size较小的时候,batch均值和方差不能很好的代表整体的均值和方差。
可以看到,当batch size到达一定值是,Batch Normalization才能显示出效果来。(同样,因为数据比较少,只观察拟合训练集的效果就好了)

Layer Normalization

参见论文:Layer Normalization

原理

Batch Normalization是对输入数据在batch上做操作。比如输入为(N,D)的数据,求均值是对N个样本的各个维度分别求均值。所以需要较大的batch size。
而Layer Normalization是对输入的数据的每一个样本(1,D),求均值方差等等,不依赖batch。

Forward Pass

因为其原理与BN只是标准化的维度不同,代码只需在BN的基础上稍作修改。

1
2
3
4
5
6
7
8
9
10
# cal the x_hat and y
x_T = x.T
x_mean_T = np.mean(x_T, axis=0)
x_var_T = np.var(x_T, axis=0)

x_hat_T = (x_T - x_mean_T) / np.sqrt(x_var_T + eps)
x_hat = x_hat_T.T
out = gamma * x_hat + beta

cache = (x_T, x_hat_T, x_mean_T, x_var_T, gamma, beta, eps)

Backward Pass

同样,在BN的BP上稍作修改

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
x, x_hat, x_mean, x_var, gamma, beta, eps = cache
N, D = x.shape

dx_hat = dout * gamma
dx_hat_T = dx_hat.T
dx_var = -0.5 * np.sum(dx_hat_T * (x - x_mean) * np.power(x_var + eps, -3 / 2), axis=0)
dx_mean = np.sum(dx_hat_T * (-1 / np.sqrt(x_var + eps)), axis=0) + np.sum(-2 * dx_var * (x - x_mean), axis=0) / N

dx_T = dx_hat_T / np.sqrt(x_var + eps) + dx_var * 2 * (x - x_mean) / N + dx_mean / N
dx = dx_T.T

dgamma = np.sum(dout * x_hat.T, axis=0)
dbeta = np.sum(dout, axis=0)

return dx, dgamma, dbeta

实验结果

Layer Normalization因为不与batch size相关,可以看到其性能与batch size似乎关系不大。对比baseline有比较好的效果。

-------------The End-------------