当前位置：文档之家› 第2章数据预处理资料

第2章数据预处理资料

– Empirical formula:
Symmetric vs. Skewed Data
（度量数据的中心趋势）
x
N
•
Mean (algebraic measure) (sample vs. population):
n
x
1 n
n i 1
xi
– Weighted arithmetic mean:
wi xi
– Trimmed mean: chopping extreme values
Mining Data Descriptive Characteristics
• Motivation
– To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
x
i 1 n
wi
• Median: A holistic measure(中值，整体度量） i1
– Middle value if odd number of values, or average of the middle two values otherwise
– Estimated by interpolation (for grouped data):
– median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
– Data dispersion: analyzed with multiple granularities of precision – Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
– Folding measures into numerical dimensions – Boxplot or quantile analysis on the transformed cube
P2 Measuring the Central Tendency
数据集，如数据仓库或数据立方体数据变换（转换） --- 将一种格式的数据转换为另一格式的数据(如规范化) 数据归约（消减） ----可以通过聚集、删除冗余特性或聚类等方法来压缩数据
Chapter 2: Data Preprocessing
• 2.1 Why preprocess the data? • 2.2 Descriptive data summarization • 2.3 Data cleaning • 2.4 Data integration and transformation • 2.5 Data reduction • 2.6 Discretization and concept hierarchy generation • Summary
值离散化和概念分层） • Summary（小结）
第二章数据预处理ቤተ መጻሕፍቲ ባይዱ
2.1 预处理的必要性
目前，数据挖掘的研究工作大都集中在算法的探讨而忽视对数据处理的研究。事实上，数据预处理对数据挖掘十分重要，一些成熟的算法对其处理的数据集合都有一定的要求：比如数据的完整性好，冗余性小，属性的相关性小等。
数据预处理是数据挖掘的重要一环，而且必不可少。要使挖掘算法挖掘出有效的知识，必须为其提供干净，准确，简洁的数据。然而，实际应用系统中收集的数据通常是“脏”数据
Chapter 2: Data Preprocessing
• Why preprocess the data?(数据预处理的必要性） • Descriptive data summarization（描述性数据汇总） • Data cleaning （数据清洗） • Data integration and transformation（数据集成和转换） • Data reduction（数据规约） • Discretization and concept hierarchy generation（数
3、不完整性
由于实际系统设计时存在的缺陷以及使用过程中的一些人为因素，数据记录可能会出现数据值的丢失或不确定，原因可能有：（1）有些属性的内容有时没有
（家庭收入，参与销售事物数据中的顾客信息）（2）有些数据当时被认为是不必要的（3）由于误解或检测设备失灵导致相关数据没有记录下来（4）与其它记录内容不一致而被删除（5）忽略了历史数据或对数据的修改
4、噪声数据
数据中存在着错误或异常（偏离期望值），血压和身高为 0就是明显的错误，当数据缺失且用默认值来填充缺失项时，很容易发生这类错误。（1）数据采集设备有问题
（2）数据录入过程中发生了人为或计算机错误（3）传输过程中发生错误
4.2 数据预处理的功能
数据清理（清洗） ------可以去掉数据中的噪声，纠正不一致数据集成 -----将多个数据源合并成一致的数据存储，构成一个完整的
n / 2 ( f )l
• Mode（众数）
median L1 (
f me dian
)c
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal mean mode 3 (mean median)
1、杂乱性如性别： A数据库 male=1 , female=2 B数据库 male=‘男’ ，female=‘女’ C数据库 male=‘M’ , female=‘F’
2、重复性
同一客观事物在数据库中存在两个以上相同的物理描述假设某周刊有100000个订户，邮件列表中0.1%的记录是重复的，主要是一个名字有不同的写法 Jon Doe 和John Doe 因此，每周需要印刷和邮寄100份额外的刊物，假设每周的邮寄和印刷费用是两圆，公司每年将浪费10000元以上

e商务文档

第2章数据预处理资料

相关文档推荐：