权重衰减(weight decay)是应对过拟合方法的常用方法。

方法

权重衰减等价于$L_2$范数正则化(regularization)。正则化是通过模型损失函数添加惩罚项来使得训练后的模型参数值较小,是应对过拟合的常用方法。

$L_2$范数正则化在模型原来的损失函数基础上添加$L_2$范数称惩罚项。从而得到训练所需最小的函数。$L_2$范数惩罚项是模型权重参数每个元素的平方和与一个正的常数的乘积。以线性回归为例:

$$l(w_1,w_2,b) = \frac{1}{n}\sum_{i=1}^{n} \frac{1}{2}(x_1^{(i)}+x_2^{(i)}+b-y^{(i)})^2$$

其中$w_1,w_2$为权重参数,$b$是偏置参数,样本$i$的输入为$x^{(i)}, x^{(i)}$,标签为$y^{(i)}$,样本数为$n$。将权重参数用向量$w=[w_1,w_2]$表示,带$L_2$范数惩罚项的新函数为

$$l(w_1,w_2,b)+\frac{k}{2n}||w||^2$$

其中超参数$k>0$。当权重参数均为0时,惩罚项最小。当$k$较大时,惩罚项在损失函数中的比重较大,这通常会使学到的权重参数的元素接近于0。当$k$设为0时,惩罚项完全不起作用。上述式子中$L_2$范数平法$||w||^2$展开后得到$w_1^2+w_2^2$。有了$L_2$范数的惩罚项后,在小批量的随机梯度下降中,权重$w_1,w_2$的迭代方式改为:

$$w_1 \leftarrow (1-\frac{n\lambda}{|\beta|})w_1-\frac{\eta}{|\beta|}\sum_{i\in\beta}x_1^{(i)}(x_1^{(i)}w_1+x_2^{(i)}w_2+b-y^{(i)}$$

$$w_2 \leftarrow (1-\frac{n\lambda}{|\beta|})w_2-\frac{\eta}{|\beta|}\sum_{i\in\beta}x_2^{(i)}(x_1^{(i)}w_1+x_2^{(i)}w_2+b-y^{(i)}$$

$L_2$范数正则化让权重$w_1$和$w_2$先自乘小于1的数,再减去惩罚项中的梯度。因此,$L_2$范数正则化又称权重衰减。权重衰减通过惩罚绝对值较大的模型参数为需要学习的模型增加了限制,这可能对过拟合有效。在实际中,有事也在惩罚项中添加偏差元素的平方和。

高维线性回归

使用下列函数生成样本标签:
$$y = 0.05 + \sum_{i=1}^p 0.01x_i+\varepsilon$$

其中噪音项$\varepsilon$服从N(0,1),p为维度。

1
2
3
4
5
6
7
8
9
10
11
12
import gluonbook as gb
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss

n_train, n_test, num_inputs = 20, 100, 200
true_w, true_b = nd.ones((num_inputs, 1)) * 0.01, 0.05

features = nd.random.normal(shape=(n_train + n_test, num_inputs))
labels = nd.dot(features, true_w) + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)
train_features, test_features = features[:n_train, :], features[n_train:, :]
train_labels, test_labels = labels[:n_train], labels[n_train:]

实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import gluonbook as gb
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss
import matplotlib.pyplot as plt

n_train, n_test, num_inputs = 20, 100, 200
true_w, true_b = nd.ones((num_inputs, 1)) * 0.01, 0.05

features = nd.random.normal(shape=(n_train + n_test, num_inputs))
labels = nd.dot(features, true_w) + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)
train_features, test_features = features[:n_train, :], features[n_train:, :]
train_labels, test_labels = labels[:n_train], labels[n_train:]

# 初始化模型参数
def init_params():
w = nd.random.normal(scale=1, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))
w.attach_grad()
b.attach_grad()
return [w,b]

# 定义L2范数惩罚项
def l2_penalty(w):
return (w**2).sum() / 2

# 定义训练和测试
batch_size, num_epochs, lr = 1, 100, 0.003

net, loss = gb.linreg, gb.squared_loss

train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features, train_labels), batch_size, shuffle=True)

def semilogy(x_vals, y_vals, x_label, y_label, x2_vals=None, y2_vals=None,
legend=None, figsize=(5.5, 2.5)):

plt.rcParams['figure.figsize'] = figsize
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.semilogy(x_vals, y_vals)
if x2_vals and y2_vals:
plt.semilogy(x2_vals, y2_vals, linestyle=':')
plt.legend(legend)
plt.show()

def fit_and_plot(lambd):
w, b = init_params()
train_ls, test_ls = [], []

for _ in range(num_epochs):
for X, y in train_iter:
with autograd.record():
# 添加L2范数惩罚项
l = loss(net(X, w, b), y) + lambd * l2_penalty(w)
l.backward()
gb.sgd([w,b], lr, batch_size)
train_ls.append(loss(net(train_features, w, b), train_labels).mean().asscalar())
test_ls.append(loss(net(test_features, w, b), test_labels).mean().asscalar())

semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'loss',
range(1, num_epochs + 1), test_ls, ['train', 'test'])

print('L2 norm of w:', w.norm().asscalar())

# 不使用权重衰减
fit_and_plot(lambd=0)
# 使用权重衰减
# fit_and_plot(lambd=3)

当未使用权重衰减(lambd=0)时,训练集上的误差远小于测试集

no_weight_decay

L2 norm of w: 11.61194

使用权重衰减(lambd=3)时,训练误差虽然提高,但是测试集上的误差下降,过拟合得到一定程度上缓解,此时权重参数更接近0。

weight_decay

L2 norm of w: 0.046675965

权重衰减的Gluon实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import gluonbook as gb
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
import matplotlib.pyplot as plt

n_train, n_test, num_inputs = 20, 100, 200
true_w, true_b = nd.ones((num_inputs, 1)) * 0.01, 0.05

features = nd.random.normal(shape=(n_train + n_test, num_inputs))
labels = nd.dot(features, true_w) + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)
train_features, test_features = features[:n_train, :], features[n_train:, :]
train_labels, test_labels = labels[:n_train], labels[n_train:]


def semilogy(x_vals, y_vals, x_label, y_label, x2_vals=None, y2_vals=None,
legend=None, figsize=(4.5, 2.5)):

plt.rcParams['figure.figsize'] = figsize
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.semilogy(x_vals, y_vals)
if x2_vals and y2_vals:
plt.semilogy(x2_vals, y2_vals, linestyle=':')
plt.legend(legend)
plt.show()

# 定义训练和测试
batch_size, num_epochs, lr = 1, 100, 0.003

net, loss = gb.linreg, gb.squared_loss

train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features, train_labels), batch_size, shuffle=True)

def fit_and_plot(wd):
net = nn.Sequential()
net.add(nn.Dense(1))
net.initialize(init.Normal(sigma=1))

# 对权重衰减,权重名称一般是以weight结尾
train_w = gluon.Trainer(net.collect_params('.*weight'), 'sgd', {'learning_rate': lr, 'wd': wd})
train_b = gluon.Trainer(net.collect_params('.*bias'), 'sgd', {'learning_rate': lr})

train_ls, test_ls = [], []

for _ in range(num_epochs):
for X,y in train_iter:
with autograd.record():
l = loss(net(X), y)
l.backward()

# 对两个Trainer分别调用step函数,从而分别更新权重和偏置
train_b.step(batch_size)
train_w.step(batch_size)

train_ls.append(loss(net(train_features), train_labels).mean().asscalar())
test_ls.append(loss(net(test_features), test_labels).mean().asscalar())

semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'loss',
range(1, num_epochs + 1), test_ls, ['train', 'test'])

print('L2 norm of w:', net[0].weight.data().norm().asscalar())


# 不使用权重衰减
fit_and_plot(wd=0)
# 使用权重衰减
# fit_and_plot(wd=3)