当前位置:
文档之家› 中国移动hadoop数据挖掘平台介绍
中国移动hadoop数据挖掘平台介绍
Large scale data in China Mobile Communication Corporation (CMCC)
Subscribers: 500 million Subscribers’ CDR(calling data record) data 5~8TB/day in CMCC For a branch company (> 20 million subscribers)
Set the targe fields to Key, other fields to Value
Define the target fields (one or all)
MapTasker 2
Set the targe fields to Key, other fields to Value
MapTasker n
Voice: 100million* 1KB = 100GB/day SMS: 100~200 million * 1KB = 100~200GB/day ……
Network signaling data, for a branch company (> 20 million subscribers) GPRS signaling data: 48GB/day for a branch companies 3G signaling data: 300GB/day for a branch companies voice, SMS signaling data, ……
Challenges and limitations of BASS
The invest of Hardware is large, and the enlargement is high cost.
62% invest is on hardware Because there’s different critia between the unix server, when enlargement, we should buy totally new unix servers rather than just makeup some unix servers.
» BC-PDM(phase II)
› Web based GUI
› Provide SaaS mode for users
› Data Transfer Tool
› Provide data upload and download tools for SaaS
› Security
› Multi-tanent and user group for branch, ACL for data access
Parallel Data Mining Platform in Telecom Industry
-- Big Cloud based Parallel Data Mining Platform Friday, Oct 2, 2009 NYC
Research Institute of China Mobile Communication Corporation Feng Cao
Off line data back up (5 branches) cost lots of time, online data back up(8 branches) cost lots of resource, file back up (18branches) restore slowly
内部资料 注意保密
Features of BC-PDM (I)
» Targeting general data analysis and data mining platform/tools
BC-PDM(phase I)
Workflow management
GUI - Drag Operation for application modeling design Job Monitoring Flow Configuration
内部资料 注意保密
BC-PDM Architecture
Data mining App
•Large Scale Data Process •Large Scale Data Mining •Excellent scalability DE
DT
•Large Scale Storage •High performance •High Availablity •Low Price 内部资料 注意保密
Set the targe fields to Key, other fields to Value
ReduceTasker 1
Reduce the same key, read from the value list and write once
ReduceTasker m
Reduce the same key, read from the value list and write once
» BC-PDM(phase II)
› DE(Data Exploration) › Simple data analysis and preview › ETL (25 more)
• To simulate SQL operation, support Join, Group by, Expression, case when, Update, and etc.
The management of IT system is complex.
One unix server can’t support a BASS, in every branch subsystme, there’s about 3-5 servers, sucrver, Interface Server, and Display server.
内部资料 注意保密
Features of BC-PDM(II)
» Targeting general data analysis and data mining platform/tools
BC-PDM(phase I)
Visualization
Text, decision tree, cake graph, and histogram
› Data mining Algorithm (4 more)
• Classifier, Sequence Association Analysis
Data mining Algorithm (9 algorithms from 3 categories based on MapReduce)
Clustering, Classifier, Association Analysis
Output Data
内部资料 注意保密
关键技术方案-并行ETL-冗余删除
功能 冗余删除操作实现了针对所有数据样本中完全相 同的两条或多条记录进行删除,只保留相同记录 中的一条记录。 1)实现数据表冗余删除的并行化 2)正确性与串行结果完全一致 3)加速比接近线性,TB级处理时间千秒级 数据库中的串行冗余删除 1)通过map对待处理数据进行分块处理,每个数 据块对应一个处理节点;map中输入的key为默 认值——每行数据的偏移量,value为该行数据的 文本形式,以此方式实现在每块中依次读入每行 数据;map任务输出中间<key,value>对,其中 ,key从整行数据文本,value为空文本; 2)对具有相同key值的数据由reduce输出:key 为整行数据,value值为空,即可实现同样的数据 记录仅保留一条数据记录; 将reduce输出结果存 储到分布式文件系统。
Current solution
Commercial database / data warehouse systems
Commercial Data Mining Tools
Network Optimization
Network QOS Analysis Singalling Data Analysis ......
Data extract from other system, Data transfer Data gather Data statics …
Based on database system, most of operation are deal in database, which realizes ELT(Extract, Load and Transfer), rather than ETL. 内部资料 注意保密
ETL (14 different ETL operations from 6 categories based on MapReduce)
Statistic, attribute processing, data sampling, query, data processing, redundancy data processing
内部资料 注意保密
Case I – Mapreduce based ETL
Function- Redundancy Remove
To delete the same records in a CDR, and reserve the unique one.
Input Data
MapTasker 1
Enterprise Miner Clemetine Intelligent Miner
Service Optimization and Log Processing
Spam Message Filtering ……
Most are running on Unix Servers, data stored in Storage Arrays