当前位置:文档之家› 数据挖掘第三章汇总

数据挖掘第三章汇总


鸢尾花(Iris)数据集
Many of the exploratory data techniques are illustrated with the Iris Plant data set. Can be obtained from the UCI Machine Learning Repository /~mlearn/MLRepository.html From the statistician Douglas Fisher Three flower types (classes): Setosa Virginica Versicolour Four (non-class) attributes Sepal width and length Petal width and length
f
requency(vi
)
具有属性值vi的对象数 m
For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time.
众数(mode) The mode of a an attribute is the most frequent attribute value
2020年9月29日星期二
数据挖掘导论
9
百分位数
用于有序或连续属性 百分位数(percentile)
x是有序或连续属性, p是0与100之间的数, 第p个百分位数xp是一个x 值, 使得x 的p%的观测值小于xp
Most summary statistics can be calculated in a single pass through the data
2020年9月29日星期二
数据挖掘导论
8
频率和众数
频率和众数: 用于离散属性
频率(frequency ):
给定一个在{v1,..., vi,..., vk}上取值的分类属性x和m个对象的集合,值 vi的频率定义为
Key motivations of data exploration include Helping to select the right tool for preprocessing or analysis Making use of humans’ abilities to recognize patterns People can recognize patterns not captured by data analysis tools
In our discussion of data exploration, we focus on Summary statistics Visualization
Online Analytical Processing (OLAP)
2020年9月29日星期二
数据挖掘导论
4
3.1 鸢尾花数据集
2020年9月29日星期二
数据挖掘导论
3
ห้องสมุดไป่ตู้
数据探索技术
In EDA, as originally defined by Tukey The focus was on visualization Clustering and anomaly detection were viewed as exploratory techniques In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory
Related to the area of Exploratory Data Analysis (EDA) Created by statistician John Tukey Tukey’s other contributions: FFT, bit, software Seminal book is Exploratory Data Analysis by Tukey A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook /div898/handbook/index.htm
2020年9月29日星期二
数据挖掘导论
6
3.2 汇总统计
汇总统计
Summary statistics are numbers that summarize properties of the data
Summarized properties include frequency, location and spread Examples: location - mean spread - standard deviation
数据挖掘导论
Pang-ning Tan, Michael Stieinbach, and Vipin Kumar著 Pearson Education LTD. 范明 等译 人民邮电出版社
第3章 数据探索
鸢尾花数据集 汇总统计 可视化
*OLAP和多维数据分析
什么是数据探索
A preliminary exploration of the data to better understand its characteristics.
第25、50和75个百分位数, 分别记为Q1、Q2和Q3, 分别称为第一、第二 和第三个四分位数(quartiles)
第二个四分位数Q2又称中位数(median) 如果值的个数n是奇数, 则中位数是有序集合的中间值; 否则中位数 是中间两个数的平均值
四分位数极差(IQR): IQR = Q3 Q1 五数概括(five-number summary)
相关主题