当前位置:文档之家› 数据挖掘-决策树

数据挖掘-决策树


Handling Continuous Attributes
自 动 离 散 化 成 2 群
Handling Continuous Attributes
Handling Continuous Attributes
Age <= 39 Age > 39
找Gain Ratio 最大的切點
Age <= 32.5
Practice in LAB: Tree Pruning
Node 0 Yes:9 No:5 Node 1 Yes:4 No:0 Node 3 Yes:1 No:4 Node 2 Yes:5 No:5 Node 4 Yes:4 No:1
I(3,0)=0 I(3,4)=0.99
Handling Continuous Attributes
Best Split
Tree Pruning in C4.5
U25%(0,1)=0.750 U25%(0,6)=0.206 U25%(0,9)=0.143
An Example
U25%(1,16)=0.157
Classification by Decision Tree
Classification by Decision Tree

Four decision tree algorithms are provided by Clementine

CHAID, CART, C4.5, C5.0

They are all top-down decision tree generation algorithms
Tree Pruning in C4.5
是否要删除此节点?
A Formula for Estimating the Error Rate at the Node


N is the number of examples E is the number of errors f = E/N is the observed error rate z is the number of standard deviations corresponding to the confidence c, which for c=25% is z=0.69 e is the estimated error rate
Age > 32.5
27
30
35
38
40
41
42
43
45
55
M
Age <= 28.5
M
M
F
F
F
M
M
M
F
Age > 28.5
Age <= 40.5
Age > 40.5
I(6,4)=0.97
Age <= 36.5 Age > 36.5
M:3 F:0
M:3 F:4
Entropy=0.3*I(3,0)+0.7*I(3,4)=0.69 Information Gain=0.97-0.69=0.28 Information Value=I(3,7)=0.88 Gain Ratio=0.28/0.88=0.32

It cannot handle continuous attributes

It cannot handle missing attribute values

It did not prune the tree for handling noises
Attribute Selection in C4.5
How to Use a Tree

Directly


Test the attribute value of unknown sample against the tree A path is traced from root to a leaf which holds the label Decision tree is converted to classification rules One rule is created for each path from the root to a leaf


Tree Pruning (Avoid Overfitting Problem)

Training data may contain nois 7 8 9 10 11
Eye Black Black Black Black Brown Brown Blue Blue Blue Blue Brown Hair Black White White Black Black White Gold Gold White Black Gold Height Short Tall Short Tall Tall Short Tall Short Tall Short Short Oriental Yes Yes Yes Yes Yes Yes No No No No No
Testing Phase
Decision Tree

Testing (Classification)

Test data are used to estimate the accuracy of the classification rules If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples
两种做法:
1. 修剪法 (Pruning Technique) Buttom-Up (C5/CART) 2. 盆栽法 (Bonsai Technique) Top-Down (CHAID)
Decision Tree Generation Algorithm: ID3
Entropy
Decision Tree Algorithm: ID3
Decision Tree
李御玺 (Yue-Shi Lee) 铭传大学资讯工程学系
leeys@.tw
Decision Tree
Learning Phase
Decision Tree
Decision Tree

Learning


The target attribute is credit_rating Training data are analyzed by a decision tree algorithm The classifier is represented in the form of classification rules



Watch the game and home team wins and out with friends then bear Watch the game and home team wins and sitting at home then diet soda Watch the game and home team loses and out with friend then bear Watch the game and home team loses and sitting at home then milk Watch the game and out with friends then bear Watch the game and home team wins and sitting at home then diet soda Watch the game and home team loses and sitting at home then milk

Optimization for these rules


Decision Tree Generation Algorithm: ID3

Prefer Attributes with many values

All attributes are assumed to be categorical (discretized)

These measures are also called goodness functions and used to select the attribute to split at a tree node during the tree generation phase
Tree Pruning (Avoid Overfitting Problem)
Decision Tree Algorithm: ID3
I(2,3)=0.971
I(3,2)=0.971
I(4,0)=0
Decision Tree Algorithm: ID3
Information Gain
Decision Tree Algorithm: ID3
yes
Decision Tree Algorithm: ID3
此公式可用来估计 真正的节点错误率
Prune or Reserve the Subtree
由于不展开此节点的 错误率为0.46低于展 开后的0.51,故最后 结果为不展开
E=5, N=14, f=0.36 e=0.46 0.47*6/14+0.72*2/14+ 0.47*6/14=0.51
E=2, N=6, E=1, N=2, E=2, N=6, f=0.33 f=0.5 f=0.33 e=0.47 e=0.72 e=0.47

Indirectly

Generating Classification Rules
Generating Classification Rules
相关主题