sklearn tree structure

此文用来分析下sklearn的tree结构是如何表示的。树的可视化,基于树的规则提取,基于树的特征构造等需要从基本的tree结构中提取相关信息。

首先,我们构造一个DecisionTreeClassifier。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

In [15]:

from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd

clf = DecisionTreeClassifier(max_depth=3)
x = np.random.sample((10000, 10))
y = np.random.randint(0, 2, 10000)
clf.fit(x, y)

Out[15]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')

树结构:clf.tree_

树结构的信息都存储在clf.tree_属性里,clf.tree_属性有:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Attributes
----------
node_count : int
The number of nodes (internal nodes + leaves) in the tree.
总结点数,内部结点(包含结点)+叶结点

capacity : int
The current capacity (i.e., size) of the arrays, which is at least as
great as `node_count`.
[TODO]: 没发现和node_count属性有啥区别

max_depth : int
The maximal depth of the tree.
树深,根结点的树深定义为0

children_left : array of int, shape [node_count]
children_left[i] holds the node id of the left child of node i.
For leaves, children_left[i] == TREE_LEAF. Otherwise,
children_left[i] > i. This child handles the case where
X[:, feature[i]] <= threshold[i].
数组,索引是结点编号,索引对应的值是该索引对应结点的左儿子编号,根结点编号是0,-1代表没有左儿子

children_right : array of int, shape [node_count]
children_right[i] holds the node id of the right child of node i.
For leaves, children_right[i] == TREE_LEAF. Otherwise,
children_right[i] > i. This child handles the case where
X[:, feature[i]] > threshold[i].
数组,索引是结点编号,索引对应的值是该索引对应结点的右儿子编号,根结点编号是0,-1代表没有右儿子

feature : array of int, shape [node_count]
feature[i] holds the feature to split on, for the internal node i.
数组,记录每个结点使用的分裂特征编号

threshold : array of double, shape [node_count]
threshold[i] holds the threshold for the internal node i.
数组,记录每个结点使用的分裂条件,左儿子条件为小于等于,右儿子为大于

value : array of double, shape [node_count, n_outputs, max_n_classes]
Contains the constant prediction value of each node.
记录每个结点(每个,不仅仅是叶子结点)的每个类别样本数

impurity : array of double, shape [node_count]
impurity[i] holds the impurity (i.e., the value of the splitting
criterion) at node i.
记录每个结点的不纯度。不纯度基于分裂准则计算。由父结点的不纯度以及对应子结点的不纯度就可以得到分裂的信息增益。

n_node_samples : array of int, shape [node_count]
n_node_samples[i] holds the number of training samples reaching node i.
记录每个结点上的样本量。就是value属性每个结点各类别样本数加和

weighted_n_node_samples : array of int, shape [node_count]
weighted_n_node_samples[i] holds the weighted number of training samples
reaching node i.
记录每个结点的带权重样本量。

经由children_left以及children_right属性,就可以构造出整个的树结构。对应的feature、threshold、n_node_samples可以拿到对应的结点分裂条件和结点样本量。

这里注意一点,决策树的树结构中,一定不存在只有左子结点或者右子结点的父结点,因为这种情况下根本没有分裂的必要。根据children_left[i]==chidren_right[i]==-1可以得到i号结点为叶节点。