当前位置:文档之家› 大数据挖掘方法与应用

大数据挖掘方法与应用

Heterogeneous, Autonomous sources with distributed and
decentralized control, and seeks to explore
Complex and
Evolving relationships among data. 9
大数据数据挖掘的挑战
students’ satisfaction
2
舍布鲁克大学主校区
3
舍布鲁克大学主校园
4
舍布鲁克大学主校园
5
舍布鲁克大学医学院校区
6
舍布鲁克大学蒙特利尔分校
7
Agenda
大数据数据挖掘的挑战 超高维数据挖掘的若干问题
异常检测 聚类和分类
序列数据的聚类算法
显著模式的发现和应用
序列数据的统计模型 应用
images or videos for X-ray examination and CT scan microarray expression images and sequences for a DNA or
genomic-related test,
Heterogeneous features : different types of representations for the same individuals,
Fraud detection Fault diagnosis Intrusion detection Satellite image analysis Public health monitoring Etc.
15
Outline of the work
Defining a new measurement weighted holo-entropy
大数据挖掘方法与应用
王声瑞 舍布鲁克大学
2014-12-06
1
加拿大舍布鲁克大学
37000 students from more than 100 countries Coop programs (Work/Study) Exceptional human and natural environments Strong research in healthcare, sciences and
Diverse features : variety of the features involved to represent each single observation
11
大数据数据挖掘的挑战
12
主要合作单位
CHUS
13
高维数据挖掘的若干问题:异常检测
Outlier detection and recommendation systems
According to IBM (2012), 2.5 quintillion bytes of data are generated each data
1 quintillion = 10 18 bytes 90 percent of the data in the world today were
Proposing two practical, 1-parameter algorithms for detecting outliers in large-scale categorical datasets
16
Holo-entropy
Holo-entropy
is the sum of the entndom vector .
Entropy describes the uncertainty related to a whole data set.
Total correlation is the sum of mutual information measuring the shared information of a dataset.
S. Wu and S. Wang, “Parameter-free Outlier Detection for Large-scale Categorical Data”, IEEE Trans. on Knowledge and Data Engineering, 2013
14
INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA
Formulating as an optimization problem
Defining differential holo-entropy Computing and updating the outlier factor of an object Providing upper bound on outliers
produced within the past two years
10
大数据数据挖掘的挑战
A single human being in a biomedical world can be represented by using
simple demographic information such as gender, age, family disease history
engineering, and business administration $185M in research fundings per year 7th to 14th places in Macleans rankings, 235th place in global Leiden rankings 1st in Canada in terms of invention revenues, and
社交媒体数据挖掘
8
大数据数据挖掘的挑战
5V : Volume + Variety + Velocity + Variability + Veracity
HACE Theorem (Wu et al, IEEE TKDE, 2014): Big Data starts with large-volume,
相关主题