当前位置：文档之家› 聚类分析 PPT课件

聚类分析 PPT课件

(f) (f) p dij f 1 ij d (i, j) (f) p f 1 ij
f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise f is ordinal Compute ranks rif and Treat zif as interval-scaled
x1 x2 x3 x4
x1 0 3.61 5.1 4.24
x2 0 5.1 1
x3
x4
5
0 5.39
0
第二节相似性的量度
一样品相似性的度量
二变量相似性的度量
含名义变量样本相似性度量
例：学员资料包含六个属性：性别（男或女）；外语语种
（英、日或俄）；专业（统计、会计或金融）；职业（教师或非教师）；居住处（校内或校外）；学历（本科或本科以下）现有两名学员： X1=（男，英，统计，非教师，校外，本科）′ X2=（女，英，金融，教师，校外，本科以下）′ 对应变量取值相同称为配合的，否则称为不配合的记配合的变量数为m1，不配合的变量数为m2，则样本之间的距离可定义为
第五章聚类分析
第一节第二节第三节第四节第五节引言相似性的量度系统聚类分析法 K均值聚类分析 K中心点聚类
第六节
R codes
第一节引言
“物以类聚，人以群分” 无监督分类聚类分析分析如何对样品（或变量）进行量化分类的问题 Q型聚类—对样品进行分类 R型聚类—对变量进行分类
用他们的序代替xif
zif
rif 1 M f 1
10
混合型属性
A database may contain all attribute types Nominal, symmetric binary, asymmetric binary, numeric, ordinal 可以用加权法计算合并的影响
3
数据矩阵和相异度矩阵
Data matrix n data points with p dimensions
x11 ... x i1 ... x n1 ... x1f ... ... ... xif ... ... ... xnf ... x1p ... ... ... xip ... ... ... xnp
9
有序变量Ordinal Variables
一个序变量可以离散的或连续的 Order is important, e.g., rank Can be treated like interval-scaled
rif { 1 ,...,M f } 映射每一个变量的范围于[0,1]，用如下值代替第f-th变量的i-th对象
性别是对称属性
The remaining attributes are asymmetric binary 令Y and P 值为1, 且N值为0
01 0.33 2 01 11 d ( jack, jim ) 0.67 111 1 2 d ( jim , mary) 0.75 11 2 d ( jack, mary)
f is numeric: use the normalized distance
r zif M
if
1
f
1
11
规范数值数据
Z-score: X: 需标准化的原始数值, μ: 总体均值, σ: 标准差 “-”, “+” 另一种方法: Calculate the mean absolute deviation
Dissimilarity matrix n data points, but registers only the distance A triangular matrix
0 d(2,1) 0 d(3,1 ) d ( 3, 2 ) : : d ( n,1) d ( n,2)
z x
在标准偏差单位下，原始分数和总体均值之间的距离
其中
sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |)
m2 d12 m1 m2
本例中X1 与X2 之间的距离为2/3
二进制属性的邻近度量
j
二进制数据的列联表
contingency table
Object i
对称二元变量的距离侧度:
不对称二元变量的距离侧度: Jaccard系数(不对称二元变量
的相似性侧度):

Note: Jaccard coefficient is the same as “coherence”:
8
二进制属性的相异度量
Example
Name Jack Mary Jim Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N
0 : ... ... 0
4
例: 数据矩阵和相异度矩阵
Data Matrix
point x1 x2 x3 x4
attribute1 attribute2 1 2 3 5 2 0 4 5
Dissimilarity Matrix (with Euclidean Distance)

相似性和相异性
Similarity 数值测量两个数据对象类似程度目标越相似时值越大通常介于 [0,1] Dissimilarity (e.g., 距离distance) 数值测量两个数据对象差异程度 Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies 邻近度Proximity refers to a similarity or dissimilarity

e商务文档

聚类分析 PPT课件

相关文档推荐：