当前位置:文档之家› 云计算与大数据-孟小峰

云计算与大数据-孟小峰

数据挖掘教学研讨会,北京, 2012,8,9
Cloud Computing and Big Data Cloud Computing is just like the highway which can support a variety of transportation Big Data can be seen as one vehicle on the highway Cloud Computing is infrastructure while Big Data is its service object
数据挖掘教学研讨会,北京, 2012,8,9
Integration
Right metadata is needed there has to be some translation of data as it flows from one model(platform) to the other. E.g. Transfer data from Hadoop to DB2
数据挖掘教学研讨会,北京, 2012,8,9
Analysis
Fundamentally different from traditional statistical analysis on small samples Real-time analysis Lack of coordination between database systems
数字的社会化:
数据足迹及其结构本身就是社会结构和过程的一个 环节,不断塑造着新的社会秩序和关系
数据挖掘教学研讨会,北京, 2012,8,9
数据思维:计算社会科学
一切社会解释、监控、预测与规划都离不开对 数据足迹的收集、整理和分析 计算社会科学方法:
基于特定社会需要,在特定社会理论指导下,收集、 整理和分析数据足迹,以便进行社会解释、监控、 预测与规划的过程和活动
数据挖掘教学研讨会,北京, 2012,8,9
安阳殷墟遗址(公元前1300,距今3300年)
甲骨文大坑, 1万7千余片
数据挖掘教学研讨会,北京, 2012,8,9
Big Data Application 应用
科学计算 股市交易 Web数据 微博数据 。。。
数据挖掘教学研讨会,北京, 2012,8,9
数据挖掘教学研讨会,北京, 2012,8,9
“Data is widely available; what is scarce is the ability to extract wisdom from it.”
Hal Varian, Google's chief economist
数据挖掘教学研讨会,北京, 2012,8,9
“大海捕鱼”vs.“池塘捕鱼”
数据挖掘教学研讨会,北京, 2012,8,9
Timeliness Many situations need the result of analysis immediately Real-time processing can be a challenge with big data, especially in dynamic data environments like financial trading and social media. Develop partial results in advance and then do incremental computation New index structures are required
From:
数据挖掘教学研讨会,北京, 2012,8,9
Acquisition
Multiple data resource and huge amount Much of this data is of no interest Data Reduction is important
数据挖掘教学研讨会,北京, 2012,8,9
From:
数据挖掘教学研讨会,北京, 2012,8,9
Logic Memory Archival Active Storage Parallelism 1980 Parallelism across in a cluster CPU TAPE RAM nodes DISK Parallelism within single fast, synch a slow, asynch node 2008 Cloud Computing CPU RAM TAPE DISK New hardware: SSD、PCM…
Extraction & Cleaning
Various data type: Structured &Unstructured Extraction is often highly application dependent Missing information and error information should be cleaned.
Big Data, Extremely Large Database(XLDB)
>PB,非结构数据 以数据为资源解决诸领域问题
数据思维
Data Thinking
数据挖掘教学研讨会,北京, 2012,8,9
社会的数字化与数字的社会化
社会的数字化:数据足迹(data print)
在数字化时代,各色人等有意无意留下的数据足迹 越来越丰富 数据足迹是有社会意义(social meaning)的,蕴 含着社会结构
数据挖掘教学研讨会,北京, 2012,8,9
Outline
1 2 3 4 5ຫໍສະໝຸດ Introduction to Big Data Cloud Computing and Big Data Challenging Problems Our Work Conclusion
数据挖掘教学研讨会,北京, 2012,8,9
数据挖掘教学研讨会,北京, 2012,8,9
What Can Big Data do ?
Prediction
数据挖掘教学研讨会,北京, 2012,8,9
What Can Big Data do ?
华尔街根据民众情绪抛售股票 对冲基金依据购物网站的顾客评论,分析企业产品销 售情况 银行根据求职网站的岗位数量,推断就业率 投资机构收集并分析上市企业声明,从中寻找破产的 蛛丝马迹 美国疾病控制和预防中心依据网民搜索,分析全球范 围内流感等病疫的传播情况 美国总统奥巴马的竞选团队依据选民的微博,实时分 析选民对总统竞选人的喜好
Data, Data and Data!
数据挖掘教学研讨会,北京, 2012,8,9
Difficult to get the data
Data is all around you! Data type is various Most data is occupied by company Researchers are difficult to get the data
数据挖掘教学研讨会,北京, 2012,8,9
Big Data Analytics Tools in Use
Don't know We aren't using big data analytics tools Other ParAccel Analytic Database Kognitb WX2 Infobright Sybase IQ EMC Greenplum Teradata EDW HP Vertica IBM Netzza Hadoop/Mapreduce IBM DB2 Smart Analytics System Microsoft SQL PDW Oracle Exadata 0% 5% 10% 15% 3% 1% 1% 2% 4% 8% 9% 9% 10% 11% 12% 18% 21% 20% 25% 30% 35% 40% 11% 35%
数据挖掘教学研讨会,北京, 2012,8,9
Batch Process: MapReduce
Stream Process: Storm(Twitter), S4(Yahoo!)
数据挖掘教学研讨会,北京, 2012,8,9
Interpretation
Big data is of limited value if users cannot understand the analysis The provenance of the result data Data visualization
用户数 精确度 可靠度 数据量 反应
少 大量 大量 大量 极高 高 中等 -- 高 中等 -- 高 低 -- 中等 极高 中等 中等 Tera Gega Peta 100Peta 慢 快 快 快
Outline
1 2 3 4 5
Introduction to Big Data Cloud Computing and Big Data Challenging Problems Our Work Conclusion
数据挖掘教学研讨会,北京, 2012,8,9
Big Data Analysis Pipeline
Collaboration of cloud computing can greatly promote these process
Interpretation Analysis Integration Extraction& Cleaning Acquisition
数据挖掘教学研讨会,北京, 2012,8,9
Outline
1 2 3 4 5
Introduction to Big Data Cloud Computing and Big Data Challenging Problems Our Work Conclusion
相关主题