当前位置：文档之家› 大数据挖掘作业

大数据挖掘作业

大数据挖掘与机器学习第五章
【论述题】
利用所给信用数据cs-training.csv建立分类器进行分析，并用cs-test.csv进行测试，其中Revolving为分类变量。

要求：
（1）先对数据进行描述统计分析
（2）利用CART，c4.5,Bagging,Adaboost,随机森林方法进行进行建模并比较。

1.描述统计分析：
read.csv(“d://cst.csv”,header=T)
cst=cst[-1]
table(cst$class)
md.pattern(cst)
set.seed(1234)
分类树
library(tree)
Cs.tree=tree(class~.-class,cst[test,])
Summary(cst.tree)
在summary中我们可以看到训练误差为，残差的平均偏差小代表这种方法在训练集上的拟合效果好。

cs.test.pred=predict(cs.tree,cst[-test,],type=’class’)
Table(cs.test.pred,cst[-train,’class’])
Bagging
Library(randomForest)
Cs.bag=randomForest(class~.,cs[test,],na.action=naroughfox,mtry=ncol(cst)-1)
Cst.bag.pred=predict(cst.bag,cst[test,])
Table(cst.bag.pred,cst[-train,’class’])
结果如下：
此时，模型在测试集上的预测准确度为，
随机森林
Library(randomForest)
Cst.rf=randomForest(class~.,cst[train,],na.action=na.roughfix,importance=T)
cst.rf.pred=predict(bio.rf,cst[-train,])
Table(cst.rf.pred,cst[-train,’class’])
结果如下：
可以看出，随机森林的表现较bagging有所提高，在测试集上的准确率提高到，。

e商务文档

大数据挖掘作业

相关文档推荐：