聊聊随机森林

之前那篇《决策树-非线性分类与回归》的文章中，重点介绍了决策树的用法，并且详细介绍了熵、信息增益以及基尼不纯度的概念，在文章的结尾，稍微提了一下随机森林，并将其结果与决策树作了简单的对比。那么这篇文章中我就仔细来聊一聊随机森林的用法。

还是从决策树开始 #

要介绍随机森林，逃离不了决策树，为了让这篇文章逻辑清晰一些，我先来用决策树做一个简单的例子。

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn import datasets
X,y = datasets.make_classification(n_samples=1000,n_features=3,n_redundant=0)
#调用决策树的类，拟合数据
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X,y)
preds = dt.predict(X)
(y == preds).mean()

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')






1.0

没错，很明显上面的结果应该是1。我们用训练数据来训练了模型，后来再用训练数据去预测结果，答案一定是\(100%\)正确的。
从上面第一个输出可以看出，在初始化模型时有很多参数，那么这些参数一定对结果很有影响，比如树的深度，决策依据的是’Gini’还是’entropy’(对这个概念还不了解的，请猛戳这篇文章)等等。

下面，我们首先来探索一下*‘max_depth’*这个参数。首先我们需要把数据搞复杂一点，将解释变量设置多一些，并且随机将训练集与测试集切分出来。

InteractiveShell.ast_node_interactivity = "last_expr"

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

n_features = 200
X,y = datasets.make_classification(750,n_features=n_features,n_informative=5)
training = np.random.choice([True,False],p=[.75,.25],size=len(y))
#统计正确率，为之后作图做准备
accuracies = []
for x in np.arange(1,n_features+1):
    dt = DecisionTreeClassifier(max_depth=x)
    dt.fit(X[training],y[training])
    preds = dt.predict(X[~training])
    accuracies.append((preds == y[~training]).mean())
    
#绘图
f,ax = plt.subplots(figsize=(7,5))
ax.plot(range(1,n_features+1),accuracies,color='k')
ax.set_title("Decision Tree Accuracy")
ax.set_xlabel("Max Depth")
ax.set_ylabel("% Correct")
plt.show()

可以看得出来看，在前面时，也就是在max_depth设置的较小的一个位置，它的准确率比较高，我们可以放大来看一看

f,ax = plt.subplots(figsize=(7,5))
ax.plot(range(1, n_features+1)[:15], accuracies[:15], color='k')
ax.set_title("Decision Tree Accuracy")
ax.set_xlabel("Max Depth")
ax.set_ylabel("% Correct")
plt.show()

当然，我们只是改变了其一个参数，还可以尝试用其他参数对拟合效果的影响。这里就不赘述了，接下来我们来看随机森林。

随机森林 #

在我这里不扯什么官方语言，我们来通俗地讲一下随机森林的逻辑。

随机森林，就是以随机地方式建立一个森林，里面包含了很多决策树，但是每一棵决策树都是没有关联的。森林建立好之后，这时传入一个输入变量，会让每一棵树来对这个变量进行分类，分好类之后每棵树都会对其投票，哪个类投的票数最多，它就属于哪个类。

所以，从概念上来看，随机森林根本不care过拟合，即使每棵树的准确率只有\(60%\)，通过随机森林得到的结果的准确率也会很高。随机森林在scikit-learn里面实现起来很简单，我们在实现它的同时看看到底有哪些参数影响了其结果。

from sklearn import datasets
X,y = datasets.make_classification(1000)

#导入随机森林所用到的类，开始学习
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X,y)

#来看看我们拟合数据的效果怎么样，可以计算出准确率
print "Accuracy:\t",(y == rf.predict(X)).mean()

Accuracy:	0.994

咦！我去，奇怪了，我拿刚刚训练的数据来测试，为什么准确率不是\(100%\)？ok，接下来看一下这些中间结果，你就会很清楚随机森林内部是如何进行分类的了，come on!

当我们训练出模型之后，会有一些方法来对输入进行预测，我们首先来看一下这些方法：

predict(x): 对输入值进行分类，给出分类结果。
predict_proba(x)：给出测试中每个类的概率值，这些概率值的和为1。
predict_log_proba(x): 和predict_proba()类似，只是做了log处理。

现在我们来看一下predict_proba(x)这个方法

import pandas as pd
probs = rf.predict_proba(X)
probs_df = pd.DataFrame(probs,columns=['0','1'])
probs_df['was_correct'] = rf.predict(X) == y
probs_df.groupby('0').mean()

	1	was_correct
0
0.0	1.0	1.000000
0.1	0.9	1.000000
0.2	0.8	1.000000
0.3	0.7	0.846154
0.4	0.6	0.777778
0.5	0.5	1.000000
0.6	0.4	0.818182
0.7	0.3	1.000000
0.8	0.2	1.000000
0.9	0.1	1.000000
1.0	0.0	1.000000

看到了吧，这个表格代表着投票的结果，比如在类1投票率为\(0.7\)的时候，预测的准确率就降低到了\(0.84\)，所以这也就不能难理解，为什么用随机森林来预测训练数据时结果不是\(100%\)了。我们用柱形图来描述一下上面的结果。

f, ax = plt.subplots(figsize=(7, 5))
probs_df.groupby('0').was_correct.mean().plot(kind='bar', ax=ax)
ax.set_title("Accuracy at 0 class probability")
ax.set_ylabel("% Correct")
ax.set_xlabel("% trees for 0")
plt.show()

所以，现在问题出在哪里？我们返回来再看一下在训练模型时所用到的一些主要参数：

n_estimators : 树的个数，这个不用过多解释了。不过注意，树的个数并非越多越好；
criterion ：所以来的决策函数，‘Gini’或者’Entropy’；
max_features :在每棵决策树时需要考虑特征值的个数；
max_depth ：树的深度；
bootstrap ：是否有放回的抽样，默认是True；
n_jobs ：并行运行的job数；

下面我们就可以根据这些参数来对模型进行调优。

X,y = datasets.make_classification(n_samples=10000,n_features=20,n_informative=15,flip_y=.5,weights=[.2,.8])
training = np.random.choice([True,False],p=[.8,.2],size=y.shape)

rf = RandomForestClassifier()
rf.fit(X[training],y[training])
preds = rf.predict(X[~training])
print "Accuracy:\t",(preds == y[~training]).mean()

Accuracy:	0.639425458148

可以看到，准确率挺低的。这里插一句，什么是准确率(accuracy)： 分类器预测正确性的比例。这是个好的指标，但是如果在这里用混淆矩阵的话也许更好理解一些。下面我们迭代使用不同的max_features参数，来看看会什么不同效果。

from sklearn.metrics import confusion_matrix
import itertools
max_feature_params = ['auto', 'sqrt', 'log2', .01, .5, .99]
confusion_matrixes = {}
for max_feature in max_feature_params:
    rf = RandomForestClassifier(max_features=max_feature)
    rf.fit(X[training],y[training])
    #用ravel()方法将二维的混淆矩阵转换成了一维
    confusion_matrixes[max_feature] = confusion_matrix(y[~training], rf.predict(X[~training])).ravel()

confusion_df = pd.DataFrame(confusion_matrixes)
f, ax = plt.subplots(figsize=(7, 5))
confusion_df.plot(kind='bar', ax=ax)
ax.legend(loc='best')
ax.set_title("Guessed vs Correct (i, j) where i is the guess and j is the actual.")
ax.grid()
ax.set_xticklabels([str((i, j)) for i, j in list(itertools.product(range(2), range(2)))]);
ax.set_xlabel("Guessed vs Correct")
ax.set_ylabel("Correct")

<matplotlib.text.Text at 0x8c04c90>

哦，呵呵。看来并没有什么太大的差别。那我们现在换一个参数试试，用树的个数。

上面用到了混淆矩阵，我们可以通过混淆矩阵的值除以总数来得到准确率

n_estimator_params = range(1, 30)
confusion_matrixes = {}
for n_estimator in n_estimator_params:
    rf = RandomForestClassifier(n_estimators=n_estimator,n_jobs=-1)
    rf.fit(X[training], y[training])
    confusion_matrixes[n_estimator] = confusion_matrix(y[~training], rf.predict(X[~training]))
    accuracy = lambda x: np.trace(x) / np.sum(x, dtype=float)
    confusion_matrixes[n_estimator] = accuracy(confusion_matrixes[n_estimator])
    
accuracy_series = pd.Series(confusion_matrixes)

f, ax = plt.subplots(figsize=(7, 5))
accuracy_series.plot(kind='bar', ax=ax, color='k', alpha=.75)
ax.grid()
ax.set_title("Accuracy by Number of Estimators")
ax.set_ylim(0, 1) # we want the full scope
ax.set_ylabel("Accuracy")
ax.set_xlabel("Number of Estimators")

<matplotlib.text.Text at 0x7bbf530>

效果不是太明显，但是我们还是可以看出来准确率是向右递增的。在这里，我们还可以通过之前那篇文章中的grid_search来寻找最好的参数。

from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

pipeline = Pipeline([('clf', RandomForestClassifier())])
parameters = {
    'clf__n_estimators': (40,50,80, 100),
    'clf__max_depth': (50, 150, 250),
    'clf__min_samples_split': (1, 2, 3),
    'clf__min_samples_leaf': (1, 2, 3)
}
grid_search = GridSearchCV(pipeline,parameters,n_jobs=-1,verbose=1,scoring='f1')
grid_search.fit(X[training], y[training])
print('best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print "\t%s: %r" % (param_name,best_parameters[param_name])
    
predictions = grid_search.predict(X[~training])
print(classification_report(y[~training], predictions))

Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  4.2min
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed:  7.1min finished


best score: 0.811
Best parameters set:
	clf__max_depth: 150
	clf__min_samples_leaf: 3
	clf__min_samples_split: 2
	clf__n_estimators: 100
             precision    recall  f1-score   support

          0       0.71      0.28      0.40       735
          1       0.69      0.93      0.80      1284

avg / total       0.70      0.70      0.65      2019

哦天，请原谅上面的运行时间，公司给配的机子只有两核。。。

从上面的结果来看，有\(81%\)被预测了出来，但是正确率只有\(69%\)。看来参数还并非最优，开需要继续调试。

总结 #

今天详细聊了一下随机森林的使用。介绍了在scikit-learn中的一些重要的参数，以及如何调试它们。当然，算法用起来非常简单，真正麻烦的是对数据的处理，或者是提取特征值等等。这些工作需要做到真正的理解数据，理解场景。所以在之后的文章中，我也会多记录一些如何理解数据、处理数据的内容。

感兴趣的朋友可以加我微信好友或者发邮件一起讨论学习。