现代回归和分类
[1] 数据源于D. Michie (1989) Problems of computer-aided concept formation. In Applications of Expert Systems 2, ed. J. R. Quinlan, Turing Institute Press / Addison-Wesley, pp. 310–333.
决策树: 和回归的Cp 决策树 CP 意味着 complexity parameter, 和回归的 不同! 不同 Specifically, use printcp( ) to examine the crossvalidated error results, select the complexity parameter associated with minimum error, and place it into the prune( )function. Alternatively, you can use the code fragment fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"] to automatically select the complexity parameter associated with the smallest cross-validated error. Thanks to HSAUR for this idea.
absent 0.21
yes
Start>0 Start>=14
present 0.58
yes
absent 0.35 Start>=12
no
absent 0.21
yes
Start>=8.5
no
absent 0.00
absent 0.18 Age<55
absent 0.08
Start>=12.5 | Age< 51.5 absent 37/1
absent 10/1
Age< 86 absent 4/3 present 4/10
rpart.plot(kyphosis.rp,type=2,extra=6 )
absent 0.21
yes
Start>=12
no
absent 0.03
数据shuttle.txt). 例10.1 (数据 数据
t(table(predict(b,shuttle[tsamp,],type="class"),shuttle[tsamp,7])) t(table(predict(b,shuttle[samp,],type="class"),shuttle[samp,7]))
absent 0.44 Age<52
absent 0.09
present 0.62 Age<86
absent 0.43
present 0.71
library(rpart)
预测(2) 预测
kyphosis1 <- kyphosis [71:81, ] predict(kyphosis.rp, kyphosis1, type="class") table(predict(kyphosis.rp, kyphosis1, type="class"), kyphosis[71:81,1])
Start>=8.5 | Start>=12.5 | Start>=8.5 |
Start>=14.5 absent 29/0 Age< 55 absent 12/0 Age>=111
present 8/11
absent 44/2
Age< 34.5
absent 12/2
present 3/4
absent 9/1
Tower of Babel
决策树: 决策树:分类树和回归树
数据shuttle.txt) 例(数据 数据
library(MASS);shuttle[1:10,]
这个数据是关于美国航天飞机在各种条件下是否自动着陆的决策问 题[1]。有256行及7列。头六行为作为自变量的定性变量,而最后 一列为因变量。自变量包括稳定性(stability,取值stab/xstab)、误 差大小(error,取值(MM / SS / LX / XL)、信号(sign,取值pp / nn)、 风向(wind,取值head / tail)、风力(magn,取值(Light / Medium / Strong / Out)、能见度(vis,取值yes / no),因变量为是否用自动 着陆系统(use,取值auto/noauto)。
predict(fit, type="prob") # class probabilities (default) predict(fit, type="vector") # level numbers predict(fit, type="class") # factor predict(fit, type="matrix") # level number, class frequencies, probabilities
各个专业术语不同
• 变量(variable)在计算机/数据库等行业也叫属性(attribute)、特征 (feature) 、特性 (characteristic)、字段(field)等等 • 数量变量也叫“指标”,定性变量也叫“维度”等等 • 观测值(observation)也叫记录(record)、对象(object)、点(point) 、向量(vector)、模式(pattern)、事件(event)、例(case、 instance) instance)、样本 (sample)、或项、实体(entity)等等 (sample) (entity) • 你们需要小心!
library(rpart) 预测 预测(2)
library(rpart.plot) data(kyphosis) kyphosis.rp <- rpart( Kyphosis ~ Age + Number + Start, data=kyphosis, subset=1:70) kyphosis.rp ;plot(kyphosis.rp ); text(kyphosis.rp,use.n=T)
数据iris.txt). 例10.2 (例9.5数据 例 数据
Petal.Length< 2.45 |
Petal.Width< 1.75 setosa
versicolor
virginica
library(MASS);m=150;set.seed(10) samp <- c(sample(1:50,25), sample(51:100,25), sample(101:150,25)); tsamp=setdiff(1:m,samp);library(rpart.plot) (b=rpart(Species~.,iris,subset=tsamp)) ;plot(b);text(b,use.n=T)
present 0.60 Age<34 absent 0.10 present 0.58
absent 0.00
absent 0.29 Age>=111
absent 0.18
present 0.72
absent 0.14
present 0.57
预测 library(rpart) + Start, data=kyphosis) fit <- rpart(Kyphosis ~ Age + Number
present 11/14
absent 56/6
present 8/11
par(mfrow=c(1,3), xpd=NA) ;rpart.plot(fit,type=2,extra=6) rpart.plot(fit2,type=2,extra=6);rpart.plot(fit3,type=2,extra=6);par(mfrow=c(1,1))
no
auto 0.00
error=c noauto 0.86
noauto 0.86 error=c
stabilit=a noauto 0.60
noauto 0.95
noauto 0.60 stabilit=a
noauto 0.95
auto 0.25
noauto 1.00
auto 0.25
noauto 1.00
数据shuttle.txt). 例 (数据 数据
vis=a |
error=c auto stability=a noauto auto noauto
library(MASS);shuttle[1:10,] m=256;set.seed(2);samp=sample(1:m,floor(m/10));tsamp=setdiff(1:m,samp) library(rpart.plot);(b=rpart(use~.,shuttle,subset=tsamp)) ;b;plot(b);text(b,use.n=T) t(table(predict(b,shuttle[tsamp,],type="class"),shuttle[tsamp,7]))
library(rpart.plot)
剪枝和画图
fit <- rpart(Kyphosis ~ Age + Number + Start,data=kyphosis) fit2 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, parms=list(prior=c(.65,.35), split='information')) fit3 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, control=rpart.control(cp=.05)) par(mfrow=c(1,3), xpd=NA) ;plot(fit);text(fit, use.n=TRUE) plot(fit2);text(fit2, use.n=TRUE);plot(fit3);text(fit3, use.n=TRUE)