当前位置:文档之家› 大数据采集与清洗

大数据采集与清洗


Volume
Variety
Velocity
Veracity
Value
容量大 (Volume),指大 规模的数据量,并 且数据量呈持续增 长趋势。
种类多(Variety), 速度快
真实性
价值密度低
指数据来自多种数 (Velocity),指的 (Veracity),即 (Value),指随着
据源,数据种类和 是数据被创建和移 追求高质量的数据。 数据量的增长,数
1.日志采集系统 (Apache Flume、Scribe)
3.数据库采集系统 (关系型、nosql等 各种数据库)
大数据采集应用
5
技能准备
数据库基础(SQL语句操作) Linux操作系统基本操作 Python基础
环境准备
数据库(mysql) Jdk( java环境) Python
Thanks
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
[数据采集与清洗]
2019|02|15 周乐
什么是大数据 大数据的主要特征 大数据处理流程 大数据采集的概念 大数据采集应用
什么是大数据
1
淘宝推荐
依据你最近的阅读 行为和消费行为进 行引荐
依据时节改变进行 引荐
依据你用的设备往 来不断猜特征.
依据购物行为偏好 引荐
行业现状与前景
大数据工作首先 写入政府工作报 告
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
YOUR TITLE
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
A vs B
Thanks
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
YOUR TITLE
42%
21%
28% 9%
OKPPT工作室
3
YOUR TITLE
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
格式。
动的速度。
据中有意义的信息
却没有成相应比例
增长。
大数据处理流程
3
大数据处理流程
数据统计分析 就是对上面
已经完成的存储在大型分
数据采集 就是利用 多种数据库(关系型,
布式数据库中的数据进行 归类统计,可以满足一般 场景的分析需求。
数据展示 就是对 以上处理完的结果 进行分析,或者形 成报表。
2014-03
『十三五规划纲 要』提出『实施 国家大数据战 略』 』
2016-03
2018 年 《 政 府 工 作报告》提出: 实施大数据发展 行动,注重用互 联网、大数据等 提升监管效能
2018
2015-08
国务院发布《促 进大数据发展的 行动纲要》
2017-10
十九大提出推动 大数据战略,与 实体经济深度融 合
2019年人社部拟最新发布15项新职业
1.大数据工程技术人员 2.云计算工程技术人员 3.人工智能工程技术人员 4.物联网工程技术人员 5....
什么是大数据
大数据(Big Data)是指无法使用
传统和常用的软件技术和工具在一定时 间内完成获取、管理和处理的数据集
2
大数据的主要特征
大数据主要特征
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
终挖掘数据的潜在价值。ETL指的是Extract-Transform-Load,也就是抽取、转换、 加载。
抽取->从各种数据源获取数据 转换->按需求格式将源数据转换为目标数据 加载->把目标数据加载到数据仓库中
大数据采集系统
2.网络数据采集系统 (Scrapy 框架、 Apache Nutch)
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
YOUR TITLE
Nothing is difficult to the man who will try.
Nothing is difficult to the man who will try.
Nothing is difficult to the man who will try.
Nothing is difficult to the man who will try.
1、什么是数据采集 数据采集就是数据获取,数据源主要分为线上数据和内容数据
2、数据采集与大数据采集的区别 传统数据采集:来源单一,数据量相当小;结构单一;关系数据库和并行数据库 大数据的数据采集:来源广泛,数量巨大;数据类型丰富;分布式数据库
3、大数据采集技术方法 大数据采集技术就是对数据进行 ETL 操作,通过对数据进行提取、转换、加载,最
YOUR TITLE
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
YOUR TITLE
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
YOUR TITLE
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
NOSQL)去存储不
同来源的数据。
数据挖掘 是对数据进
行基于各种算法的分析
计算,从而起到预测的
数据预处理 就是将种数据库
数据分析的需求。
导入到大型的分布式数
据库中(目前主要是
hfds或hive),并同时
做一些简单的清洗和预
处理工作。
4
大数据采集的概念
大数据采集的概念
Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.
相关主题