当前位置:文档之家› Hadoop大数据平台介绍

Hadoop大数据平台介绍

Hadoop是什么
Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
Hadoop名字的由来
Hadoop was created by Doug Cutting and Mike Cafarella in 2005
Named the project after son's toy elephant
从移动数据到移动算法
Hadoop的核心设计理念•可扩展性
•可靠性
相对于传统的BI 架构转变
数据仓库电子表格
视觉化工

数据挖掘集成开发工具
数据集市
企业应用工具
传统文件日志社交& 网络遗留系
统结构化
非结构化
音视频数据应用非关系型数据库内存数据库NO SQL
应用
Nod e Nod
e
Nod e Hadoop *
Web Apps
MashUps
导出/导入INSIGHTS
消费Create Map 存储/计算实时数据处理通道(Spark,Storm)数据交换平台数据存储计算平台数据访问
层Kafka Flume Goldengat e
Shareplex ..传感器传感器
hadoop
的适用场景
小数据+ 小计算量OLTP 业务系统:ERP/CRM/EDA 大数据+ 小计算量如全文检索,传统的ETL
小数据+大计算量D a t
a
Compute 数据
计算
实时性
•Hadoop Common
•Hadoop Distributed File System (HDFS) •Hadoop YARN
•Hadoop MapReduce
HDFS
Hadoop Distributed File System
Distributed, scalable, and portable file-system written in Java for the Hadoop framework
HDFS
MapReduce
YARN
Hadoop 1.0和2.0MR的主要区别
YARN
资源管理器,可以高效管理集群内的计算资源,除了Hadoop,Yarn也可以和其它框架结合使用,目前市场上除了Yarn,还有Mesos.
Hadoop ZOO
动物园成员1:sqoop
Apache Sqoop
•Tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases
HBASE
•Column-oriented database management system •Key-value store
•Based on Google Big Table
•Can hold extremely large data
•Dynamic data model
•Not a Relational DBMS
PIG
•Originally developed at Yahoo 2006
•High level programming on top of Hadoop MapReduce
•The language: Pig Latin
•Data analysis problems as data flows
Apache Hive
•Data warehouse software facilitates querying and managing large datasets residing in distributed storage
•SQL Like Language
•Facilitates querying and managing large datasets in HDFS
•Mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL
Oozie
•Workflow scheduler system to manage Apache Hadoop jobs
•Oozie Coordinator jobs!
•Supports MapReduce, Pig, Apache Hive, and Sqoop, etc.
Zookeeper
•Provides operational services for a Hadoop cluster group services
•Centralized service for:
•maintaining configuration information
•naming services
•providing distributed synchronization
•and providing group services
Flume
•Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Kafka
Impala
Spark
Storm。

相关主题