R语言—传统决策树rpart
2016-12-28 14:53阅读:
R语言实现决策树
代码:
>data('bodyfat',package = 'TH.data')
>set.seed(1234)
> ind<-sample(2,nrow(bodyfat),replace = T,prob =
c(0.7,0.3))
> bodyfat.train<-bodyfat[ind==1,]
> bodyfat.test<-bodyfat[ind==2,]
> library(rpart)
>
myFormula<-DEXfat~age+waistcirc+hipcirc+elbowbreadth+kneebreadth
> bodyfat_rpart<-rpart(myFormula,data =
bodyfat.train,control = rpart.control(minsplit = 10))
>attributes(bodyfat_rpart)
> print(bodyfat_rpart$cptable)
CP
nsplit rel error xerror
xstd
1 0.67272638 0 1.00000000 1.0687268
0.19588877
2 0.09390665 1 0.32727362 0.4700500
0.11234371
3 0.06037503 2 0.23336696 0.4399747
0.09718181
4 0.03420446 3 0.17299193 0.3739830
0.09705168
5 0.01708278 4 0.13878747 0.3316298
0.07557418
6 0.01695763 5 0.12170469 0.3202835
0.06913430
7 0.01007079 6 0.10474706 0.3273852
0.06873450
8 0.01000000 7 0.09467627 0.3251986
0.06926171
cp(复杂度参数)nsplit(分支数)relerror(训练集中各种树对应的误差)xerror(交叉验证误差)xstd(交叉验证误差的标准差)
> print(bodyfat_rpart)
n= 56
node), split, n, deviance, yval
* denotes terminal node
1) root 56 7265.0290000 30.94589
2) waistcirc< 88.4 31 960.5381000
22.55645
4) hipcirc< 96.25 14
222.2648000 18.41143
8) age< 60.5 9
66.8809600 16.19222 *
9) age>=60.5 5
31.2769200 22.40600 *
5) hipcirc>=96.25 17
299.6470000 25.97000
10) waistcirc< 77.75 6
30.7345500 22.32500 *
11) waistcirc>=77.75 11
145.7148000 27.95818
22) hipcirc< 99.5 3
0.2568667 23.74667 *
23) hipcirc>=99.5 8
72.2933500 29.53750 *
3) waistcirc>=88.4 25 1417.1140000 41.34880
6) waistcirc< 104.75 18
330.5792000 38.09111
12) hipcirc< 109.9 9
68.9996200 34.37556 *
13) hipcirc>=109.9 9
13.0832000 41.80667 *
7) waistcirc>=104.75 7
404.3004000 49.72571 *
> plot(bodyfat_rpart)
> text(bodyfat_rpart,use.n = T)
图略
选择具有最小预测误差的决策树:
opt<-which.min(bodyfat_rpart$cptable[,'xerror'])
> cp<-bodyfat_rpart$cptable[opt,'CP']
>
bodyfat_prune<-prune(bodyfat_rpart,cp=cp)
> print(bodyfat_prune)
n= 56
node), split, n, deviance, yval
* denotes terminal node
1) root 56 7265.02900 30.94589
2) waistcirc< 88.4 31 960.53810
22.55645
4) hipcirc< 96.25 14
222.26480 18.41143
8) age< 60.5 9
66.88096 16.19222 *
9) age>=60.5 5
31.27692 22.40600 *
5) hipcirc>=96.25 17
299.64700 25.97000 *
3) waistcirc>=88.4 25 1417.11400 41.34880
6) waistcirc< 104.75 18
330.57920 38.09111
12) hipcirc< 109.9 9
68.99962 34.37556 *
13) hipcirc>=109.9 9
13.08320 41.80667 *
7) waistcirc>=104.75 7
404.30040 49.72571 *
> plot(bodyfat_prune,margin=0.1)
> text(bodyfat_prune,all=T,use.n = T)
用决策树模型进行预测,并与实际值进行对比。图中abline()绘制了一条对角线。一个好的预测模型,绝大多数的点应该落在对角线上或者越接近对角线越好。
- > DEXfat_pred <-
predict(bodyfat_prune, newdata=bodyfat.test)
- > xlim <-
range(bodyfat$DEXfat)
- > plot(DEXfat_pred ~ DEXfat,
data=bodyfat.test,
xlab='Observed',
+
ylab='Predicted',
ylim=xlim, xlim=xlim)