当前位置:
文档之家› 基于Kafka和Spark的实时数据质量监控平台
基于Kafka和Spark的实时数据质量监控平台
Kafka上下游的数据质量保证
Kafka
HLC
Kafka
HLC
Kafka
HLC
100K QPS, 300 Gb per hour
LoDsattDaa=t=a M==onLeoyst Money
Destination Destination
工作原理简介
工作原理
3 个审计粒度 • 文件层级(file) • 批次层级(batch) • 记录层级 (record level)
• 基于历史数据,定义 “new value strangeness” • 在时刻t,我们收到一个新的值
• Add it to the history. For each item i in the history
s[i] = strangeness function of (value[i], history)
Batch Processing
快速增长的实时数据
1.3 million
EVENTS PER SECOND INGRESS AT PEAK
~1 trillion
EVENTS PER DAY PROCESSED AT PEAK
3.5 petabytes
PROCESSED PER DAY
100 thousand
• 监控streaming数据的完整性和时延 • 数据pipeline中,Multi-producer, multi-stage, multi-destination数据流 • In near real time • 提供诊断信息:哪个DC, 机器, event/file发生问题 • 超级稳定 99.9% 在线 • Scale out • 审计数据可信
• Let p[t] = (#{i: s[i] > s[t]}+ r*#{i: sБайду номын сангаасi]==s[t]})/N, where r is uniform in (0,1)
• Uniform r makes sure p is uniform
异常检测算法2
异常检测算法3
设计概述
数据监控系统设计目标
基于Kafka和Spark的实时数据质量监控平台
技术创新 变革未来
我们要解决什么问题
数据流
Devices
Services
Interactive analytics
Applications
Kafka as data bus
Scalable pub/sub for NRT data streams
Streaming Processing
UNIQUE DEVICES AND MACHINES
1,300
PRODUCTION KAFKA BROKERS
1 Sec
99th PERCENTILE LATENCY
Producer Producer Producer Producer Producer Producer Producer Producer Producer
Audit数据实际是数据的meta data, 可以用来做各种数据流量的异常检测和监控
异常检测算法1
Holt-Winters 算法
用来训练模型和预测 • 强健性上的改进
• 使用Median Absolute Deviation (MAD) 得到更好的估值 • 处理数据丢点和噪声 (例如数据平滑)
• 自动获取趋势和周期信息 • 允许用户人工标记和反馈来更好的处理趋势变化
Destination 1
数据时延的Kibana图表
数据完整性Kibana图表
3 lines • Green how many records produced • Blue: how many reached destination #1 • Green: how many reached destination #2
基于Power BI更丰富的图表
4 阶段实时数据处理pipeline的监控
发送Audit的代码
Create a client object
Lastly
client.SendBondObject(audit);
Prepare audit object
查询统计信息的APIs
基于Audit数据的异常检测
Metadata
{ “Action” : “Produced or Uploaded”, “ActionTimeStamp” : “action date and time (UTC)”, “Environment” : “environment (cluster) name”, “Machine” : “computer name”, “StreamID” : “type of data (sheeps, ducks, etc.)”, “SourceID” : “e.g. file name”, “BatchID” : “a hash of data in this batch”, “NumBytes” : “size in bytes”, “NumRecords” : “number of records in the batch”, “DestinationID” : “destination ID”
GLR (Generalized Likelihood Ratio)
• Floating Threshold GLR, 基于新的输入数据动态调整模型 • 对于噪声比较大的数据做去除异常点
异常检测算法2
• 基于Exchangeability Martingale时间序列的在线异常检测
• 分布是否发生变化?
}
Producer
Data Center
Producer
File 1:
RRRRReeeeecccccooooorrrrrddddd12345
Produced 2440 bytes 35 records Timestamp “File 1” BatchID=adbecf415263
工作原理 – 数据与审计流
Kafka + HLC under audit
Uploaded 24 bytes 3 records Timestamp BatchID Destination 1
Audit system Produced: file 1: 53 records Uploaded: file 1: 3 records