当前位置:
文档之家› 03-DataPreprocessing-PartI(数据预处理)
03-DataPreprocessing-PartI(数据预处理)
Product (id, description, weight, unit)
Order(id, order_number, customer_id, product_id, quantity, price)
5
Data Warehouses
Data collected from multiple data sources Stored under a unified schema Usually residing at a single site Provide historical information Used for reporting
Provide enough information to distinguish by name. =, ≠ Provide enough information to sort. <, > Differences between values are meaningful. +, Differences and ratios are meaningful *, /
• {low, medium, high}, grades {A, B, C, D,ibes the degree of difference between values
• Dates, temperatures in C and F
•Ratio: Both degree or difference and ratio are meaningful
Mode, entropy, contingency
Ordinal
Median, percentiles, rank correlation, run tests, sign tests Mean, standard deviation, Pearson’s correlations, T/ F tests Geometric mean, harmonic mean, percent variation
6
Data Streams
A sequence of digital signals used for transmitting different kinds of content Sensor data: collecting gps/environment data and sending reading every tenth of a second Image data: satellite data, surveillance cameras Web traffic: a node on the Internet receives streams of IP packets
Zip code, employee ID numbers, eye color, gender Hardness of minerals {good, better, best}, street numbers Calendar dates, temps in Celsius and Fahrenheit
Data Preprocessing
Data – Things to consider
•Type of data: determines which tools to analyze the data
•Quality of data:
• Tolerate some levels of imperfection • Improve quality of data improves the quality of the results
Examples:
•Web pages visited by a user (object): • {<Homepage>, <Electronics>, <Cameras and Camcorders>, <Digital Cameras>, …, <Shopping Cart>, <Order Confirmation>}, {….} •Transactions made by a customer over a period of time: • {t1, t18, t500, t721}, {t11, t38, t43, t621, t3005}
•Nominal: Differentiates between values based on names
• Gender, eye color, patient ID
•Ordinal: Allows a rank order for sorting data but does not describe the degree of difference
Gender: 0 denotes male, 1 denotes female
•Asymmetric: if the states are not equally important
Medical Test: 0 denotes negative, 1 denotes positive
11
Attribute Properties
8
Graph Data
Data structure represented by nodes (entities) and edges (relationships)
Example:
◦ Protein subsequences ◦ Web pages and links
b a
e
c
d
9
Attribute Types
•Distinctness:
•Order:
= and ≠
<, ≤, ≥, and >
•Addition:
+ and -
•Multiplication: * and /
12
Type
Description
Examples
Operations
Nominal Categorical Or Qualitative
•Preprocessing: modify the data to better fit data mining tools:
• Change length into short, medium, long • Reduce number of attributes
2
Data
•Collection of objects or records
Document 1 Document 2 Document 3
timeout
season
coach
0 7 1
game
score
play
team
3 0 0
win
ball
0 2 0
lost
5 0 0
2 1 1
6 0 2
0 0 2
13
Interval Numeric Or Quantitative
Ratio
Temps in Kelvin, monetary quantities, counts, age, mass
Transformations
Type Categorical Or Qualitative Transformation Comments If all employee numbers are reassigned, it will not make a difference Nominal Any one to one mapping
Ordinal
Any order preserving function
{0.5, 1, 10} => {1, 2, 3}
Celsius to/from Fahrenheit Length can be measured in meters or feet
14
Numeric Or Quantitative
…
1029345 1029346 1029347 …
…
Male Male Female … 1/24/1957 151 5/3/1983 124 9/20/1991 110 92 80 74
…
62 66 54 …
3
What kind of Data?
Any data as long as it is meaningful for the target application
•Continuous Attributes: • Real numbers • Examples: temperatures, height, weight, … • Practically, can be measured with limited precision
15
Asymmetric Attributes
Interval new = a*old+b Ratio New = a*old
Discrete and Continuous Attributes
•Discrete Attributes: • Finite or countably infinite set of values • Categorical (zipcode, empIDs) or numeric (counts) • Often represented as integers • Special Case: binary attributes (yes/no, true/false, 0/1)
◦ Database data ◦ Data warehouse data ◦ Data streams ◦ Sequence data ◦ Graph ◦ Spatial data ◦ Text data
4
Database data