当前位置:
文档之家› 大数据数据挖掘培训讲义:偏差检测
大数据数据挖掘培训讲义:偏差检测
Summarization and Deviation
Detection --
What is new?
Outline
▪ Summarization ▪ KEFIR – Key Findings Reporter ▪ WSARE – What is Strange About
Recent Events
fixing them
▪ GTE – self insured for medical costs
▪ GTE healthcare costs – $X00,000,000
▪ Task: Analyze employee health care data and generate a report that describes the major problems
▪ Selecting and Reporting What is Interesting: The KEFIR Application to Healthcare Data, C. Matheus, G. Piatetsky-Shapiro, and D. McNeill, in Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996
The large increase in m1 in group s1 was caused by an increase in m3, which was caused by a rise in m5 , primarily in sector s13.
13
Report Generation
▪ Automatic generation of business-user-oriented reports
Recommendations
Hierarchical recommendation rules define appropriate intervention strategies for important measures and study areas.
Example: If measure = admission rate per 1000 & study_area = Inpatient admissions & percent_change > 0.10
▪ KEFIR received GTE’s highest award for technical achievement in 1995
▪ Key business user left GTE in 1996 and system was no longer used
▪ Publication:
Then Utilization review is needed in the area of admission certification.
Expected Savings: 20%
Explanation
A measure is explained by finding the path of related measures with the highest impact
10
Interestingness of Deviations
Impact: how much the deviation affects the bottom line Savings Percentage: how much of the deviation from the norm can be expected to be saved by the action
▪ Convert findings to a user-friendly report with text and graphics
6
KEFIR Search Space
Drill-Down Example
8
What Change Is Important?
9
Deviation Detection
▪ Drill Down through the search space ▪ Generate a finding for each measure
▪ deviation from previous period ▪ deviation from norm ▪ deviation projected for next period, if no action
5
GTE Key Findings Reporter: KEFIR
▪ KEFIR Approach:
▪ Analyze all possible deviations ▪ Select interesting findings ▪ Augment key findings with:
▪ Explanations of plausible causes ▪ Recommendations of appropriate actions
▪Focus on what is actionable!
4
Problem: Healthcare Costs
▪ Healthcare costs in US: 1 out of 7 GDP $ and rising
▪ potential problems: fraud, misuse, … ▪ understanding where the problems are is first step to
▪ Natural language generation with template matching
▪ Graphics ▪ delivered via browser
14
Sample KEFIR pages
Overview Inpatient admissions
16
Status
▪ Prototype implemented in GTE in 1995
2
data
3
Summarization
▪Concisely summarize what is new and different, unexpected
▪ with respect to previous values ▪ with respect to expected values ▪…