数据分析技术
• Application ‒ Executive compensations (outcome) as a function of company operations performance KPIs, industry, location, and category (drivers). ‒ Predicting income as a function of number of years of education, age and gender (drivers).
Theory
Business questions: • Regression explores how the value of the dependent variable (also referred to as outcome) changes when any one of the independent variables changes (also referred as drivers), while the other independent variables are held fixed. • Regression focuses on the relationship between the outputs and the inputs. It also provides a model that has some explanatory value, in addition to predicting outcomes. The outcome can be continuous or discrete and when it is discrete we are predicting the probability that the outcome will occur. • I want to predict the lifetime value of this customer and understand what drives LTV. What drives the LTV higher or lower? • I want to predict the probability that this loan will default and understand what drives default Two type of methods: • Linear regression
• Business questions:
‒ How do I group these documents by topic? ‒ How do I perform customer segmentation to allow for targeted or special marketing programs.
Method Example • An example of predicting mortgage foreclosure given delinquency rates. The points that are bunched closer to the line (x=y , perfect prediction) are indications of good prediction.
• Input variables can be continuous or discrete.
• Output ‒ A set of coefficients that indicate the relative impact of each drivers
‒ A linear expression for predicting outcome as a function of drivers.
3
Clustering An Overview of Clustering
Clustering is a popular method used to form homogenous groups within a data set based on their internal structure to discover groups such that samples within a group are more similar to each other than samples across groups.
Regression
Linear Regression Logistic Regression
I want to predict the lifetime value of this customer. I want to predict the probability that this loan will default.
representing a cluster.
Hale Waihona Puke Step2 For each record in data, calculate the squared Euclidean distances between it and the means. Assign the record to the cluster whose mean is the nearest to the record.
Method • Input variables can be continuous or discrete. • Output ‒ A set of coefficients that indicate the relative impact of each drivers ‒ A linear expression for predicting the log-odds ratio of outcome as a function of drivers. (Binary classification case) • Application (It is the preferred method for many binary classification problems) ‒ Probability of true/false ‒ Probability to approve/deny ‒ Probability to purchase from a website/no purchase Example • Estimates the probability that a borrower will default. The graph compares the distribution of defaulters(blue) and non defaulters(red) as a function of model’s predicted probability for borrowers scoring > 0.1 and < 0.98.
Features
‒ Not a predictive method, to find similarities or relationships.
• Example: K-means Clustering (used for clustering numerical data)
‒ Input: there must be a distance metric defined over the variable space. (Euclidian distance) ‒ Output: the centers of each discovered cluster, and the assignment of each input datum to a cluster. (Centroid)
“大数据时代” 人力资源管理创新研讨会
德勤 人力资本咨询
数据分析技术
2
An Overview of Analytics Theory and Methods What Kind of Business Problems to be Solved?
This table lists the typical business questions addressed by a category of techniques theory or analytical methods
Copyright © 2013 Deloitte Consulting. All rights reserved.
3
Regression (cont.) Logistic Regression
Logistic regression is used to estimate the probability that an event will occur as a function of other variables. An example is that the probability that a borrower will default as a function of his credit score , income, loan size, and his current debts.
Classification
Decision Trees
Where in the catalog should I place this product? I want to assign labels to objects?
Time Series Analysis
ARMA ARIMA
What is the likely future price of this stock? What will my sales volume be next month?
Methods
• Logistics regression
Copyright © 2013 Deloitte Consulting. All rights reserved.
3
Regression (cont.) Linear Regression