Datawhale 202103 集成学习(上)| (补充)机器学习调参方案整理

  • 时间:
  • 浏览:
  • 来源:互联网

机器学习调参总结

  • S1: 机器学习调参基础
  • S2:各种模型的调参
    • S2.1 SVM模型
    • S2.2 决策树模型
  • S3:调参实例
    • S3.1 调参思路:
    • S3.2 实例

S1: 机器学习调参基础

【目标:理解调参调整的是总误差中的** 偏差-方差均衡 **】

以线性回归模型为例,有高次项、低次项和常数项,我们训练模型的目的是使数据点的每一个值都恰好位于拟合函数上,这时模型在数据集的损失值误差即为0。
但数据集中,训练集用于训练的数据耦合性好,测试集用于检验模型表示模型泛化性好坏。
方差-偏差均衡
测试均方误差曲线呈现U型曲线,这表明了在测试误差曲线中有两种力量在互相博弈。可以证明:
E ( y 0 − f ^ ( x 0 ) ) 2 = Var ⁡ ( f ^ ( x 0 ) ) + [ Bias ⁡ ( f ^ ( x 0 ) ) ] 2 + Var ⁡ ( ε ) E\left(y_{0}-\hat{f}\left(x_{0}\right)\right)^{2}=\operatorname{Var}\left(\hat{f}\left(x_{0}\right)\right)+\left[\operatorname{Bias}\left(\hat{f}\left(x_{0}\right)\right)\right]^{2}+\operatorname{Var}(\varepsilon) E(y0f^(x0))2=Var(f^(x0))+[Bias(f^(x0))]2+Var(ε)
也就是说,我们的测试均方误差的期望值可以分解为 f ^ ( x 0 ) \hat{f}(x_0) f^(x0)的方差、 f ^ ( x 0 ) \hat{f}(x_0) f^(x0)的偏差平方和误差项 ϵ \epsilon ϵ的方差。

可以从函数构成解读如上曲线:

  • 黄色曲线仅包含低次方项,例如它的构成是: f ( x 0 ) = a x 0 + b f(x_0) = ax_0+b f(x0)=ax0+b
  • 绿色曲线在黄色曲线基础上抖动明显,可能包含较多高次项,例如它的构成是: f ( x 0 ) = a x 0 n + b x 0 n − 1 + . . . + x 0 + k f_(x_0)=ax_0^{n} + bx_0^{n-1} + ... + x_0 + k f(x0)=ax0n+bx0n1+...+x0+k
  • 方差项是模型未能完美拟合高次方项 x n x^n xn 产生的,绿色曲线中较明显;
  • 偏差项是模型未能匹配好低次方项 x x x 或常数项 k k k 产生的,黄色与绿色曲线均存在;

为了使得模型的测试均方误差达到最小值,也就是同时最小化偏差的平方和方差。由于我们知道偏差平方和方差本身是非负的,因此测试均方误差的期望不可能会低于误差的方差,因此我们称 Var ⁡ ( ε ) \operatorname{Var}(\varepsilon) Var(ε)为建模任务的难度,这个量在我们的任务确定后是无法改变的,也叫做不可约误差。

【PS:调参过程就是调整超参数,训练算法达到方差-偏差均衡的过程。】

S2:各种模型的调参

S2.1 SVM模型

包含两个重要参数:

  • C为松弛变量大小,作为惩罚项,C越大表示分类边界引入的噪声点越多。
  • Gamma 为数据的分散程度,gamma在rbf核中可理解为正态分布方差的倒数,方差大gamma小则数据分散,方差小gamma大则数据集中。
    目的:希望C小,Gamma大(理想状态C为0,Gamma为1)。
    【PS:后期持续补充推导过程】

S2.2 决策树模型

【PS:后期持续补充推导过程】

S3:调参实例

S3.1 调参思路:

  1. 优先考虑训练尽可能大的模型,目的在于保证结果方差最小,并保证模型不发生过拟合下进行下步。
  2. 模型最大化的情况下,继续减小过度拟合的可能性(目的是继续减小方差,如SVM中保证Gamma值尽可能大),并减小偏差带来的影响(如保证SVM中松弛变量C最小)。

S3.2 实例

课后作业:
fetch_lfw_people数据集,进行一次实战。

#下载数据
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)

#画一些人脸,看看需要处理的数据
import matplotlib.pyplot as plt
import seaborn as sns;sns.set()
fig, ax = plt.subplots(3,5)
fig.subplots_adjust(left=0.0625, right=1.2, wspace=1)
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])
['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']
(1348, 62, 47)

LFW人脸Demo

步骤一:通过控制PCA维度 建立最大的模型

# PCA降维选择特征数目 尽量使 C小,Gamma大。
for PCA_ in [50, 100, 150, 200, 250, 300, 350, 400]:
     #为了测试分类器的训练效果,将数据集分解成训练集和测试集进行交叉检验
    x_train, x_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=42)

    # PCA降维数据 并构建SVM分类模型
    pca = PCA(n_components=PCA_, whiten=True, random_state=42)
    svc = SVC(kernel='rbf', class_weight='balanced')
    model = make_pipeline(pca, svc)
   
    #用网络搜索交叉检验来寻找最优参数组合。通过不断调整C(松弛变量)和参数gamma(控制径向基函数核的大小),确定最优模型
    param_grid = {'svc__C': [1,5,10, 20, 30], 'svc__gamma':[0.0001, 0.0005, 0.001, 0.005]}
    grid = GridSearchCV(model, param_grid)

    grid.fit(x_train, y_train)
    print("PCA ", str(PCA_), " \tbest parameter -> ", grid.best_params_)
PCA  50  	best parameter ->  {'svc__C': 10, 'svc__gamma': 0.005}
PCA  100  	best parameter ->  {'svc__C': 5, 'svc__gamma': 0.005}
PCA  150  	best parameter ->  {'svc__C': 10, 'svc__gamma': 0.001}
PCA  200  	best parameter ->  {'svc__C': 5, 'svc__gamma': 0.001}
PCA  250  	best parameter ->  {'svc__C': 5, 'svc__gamma': 0.001}
PCA  300  	best parameter ->  {'svc__C': 10, 'svc__gamma': 0.0001}
PCA  350  	best parameter ->  {'svc__C': 10, 'svc__gamma': 0.0005}
PCA  400  	best parameter ->  {'svc__C': 10, 'svc__gamma': 0.0005}

步骤二:遵循C小,Gamma大的基础上,选择PCA最大值为100。

# 选择PCA最合适的为PCA=100
x_train, x_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=42)

# PCA降维数据 并构建SVM分类模型
pca = PCA(n_components=PCA_, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)

#用网络搜索交叉检验来寻找最优参数组合。通过不断调整C(松弛变量)和参数gamma(控制径向基函数核的大小),确定最优模型
param_grid = {'svc__C': [1,5,10, 20, 30], 'svc__gamma':[0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid)

grid.fit(x_train, y_train)
print("PCA ", str(PCA_), " \tbest parameter -> ", grid.best_params_)

步骤三:进一步优化参数C和Gamma。

# PCA100降维数据 并构建SVM分类模型
pca = PCA(n_components=100, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)

#用网络搜索交叉检验来寻找最优参数组合。通过不断调整C(松弛变量)和参数gamma(控制径向基函数核的大小),确定最优模型
param_grid = {'svc__C': [1, 3, 5, 7, 10, 13, 15], 'svc__gamma':[0.001, 0.003, 0.005, 0.007, 0.01]}
grid = GridSearchCV(model, param_grid)

grid.fit(x_train, y_train)
print("PCA ", 100, " \tbest parameter -> ", grid.best_params_)
Result PCA  100  	best parameter ->  {'svc__C': 13, 'svc__gamma': 0.007}
# PCA100降维数据 并构建SVM分类模型
pca = PCA(n_components=100, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)

#用网络搜索交叉检验来寻找最优参数组合。通过不断调整C(松弛变量)和参数gamma(控制径向基函数核的大小),确定最优模型
param_grid = {'svc__C': [10, 11, 12, 13, 14, 15], 'svc__gamma':[0.005, 0.006, 0.007,0.008, 0.009, 0.01]}
grid = GridSearchCV(model, param_grid)

grid.fit(x_train, y_train)
print("PCA ", 100, " \tbest parameter -> ", grid.best_params_)
Result PCA  100  	best parameter ->  {'svc__C': 11, 'svc__gamma': 0.007}

步骤四:选择最优参数为C=11, Gamma=0.007,绘制训练曲线。

# 绘制训练曲线函数准备
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.

    Parameters
    ----------
    estimator : estimator instance
        An estimator instance implementing `fit` and `predict` methods which
        will be cloned for each validation.

    title : str
        Title for the chart.

    X : array-like of shape (n_samples, n_features)
        Training vector, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    y : array-like of shape (n_samples) or (n_samples, n_features)
        Target relative to ``X`` for classification or regression;
        None for unsupervised learning.

    axes : array-like of shape (3,), default=None
        Axes to use for plotting the curves.

    ylim : tuple of shape (2,), default=None
        Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).

    cv : int, cross-validation generator or an iterable, default=None
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, default=None
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like of shape (n_ticks,)
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the ``dtype`` is float, it is regarded
        as a fraction of the maximum size of the training set (that is
        determined by the selected validation method), i.e. it has to be within
        (0, 1]. Otherwise it is interpreted as absolute sizes of the training
        sets. Note that for classification the number of samples usually have
        to be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 5))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes,
                       return_times=True)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, 'o-')
    axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
                         fit_times_mean + fit_times_std, alpha=0.1)
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    # Plot fit_time vs score
    axes[2].grid()
    axes[2].plot(fit_times_mean, test_scores_mean, 'o-')
    axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1)
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")

    return plt
# 绘制训练曲线
SVM_C = 11
SVM_GAMMA = 0.007

fig, axes = plt.subplots(3, 1, figsize=(10, 15))

title = r"Learning Curves (SVM, RBF kernel, $C={}$, $\gamma={}$)".format(str(SVM_C), str(SVM_GAMMA))
# SVC is more expensive so we do a lower number of CV iterations:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
model = make_pipeline(pca, SVC(C=SVM_C, gamma=SVM_GAMMA))
plot_learning_curve(model , title, x_train, y_train, axes=axes[:], ylim=(0.3, 1.2), cv=cv, n_jobs=4)

# plt.savefig('./IMG/lfw_svm_c{}_gamma{}.jpg'.format(str(SVM_C), str(SVM_GAMMA)))
plt.show()

SVM训练结果

参考内容:

  • Datawhale 集成学习 上

本文链接http://www.dzjqx.cn/news/show-617236.html