第九章 回归分析
系数:
参数a、b的最小二乘估计
A good
line is one that minimizes the sum of squared differences between the points and the line.
根据推导,
a y bx
( x x )( y y ) b (x x)
Multiple Regression
R2adj - “adjusted R-square”
R2是一个受自变量个数与样本规模之比(k:n)影响的系数,一般是1:10 以上为好。当这个比值小于1:5时,R2倾向于高估实际的拟合的程度。 Takes into account the number of regressors in the model
X的变异
r2
Y的变异
Simple Regression
R2 - “Goodness of fit”
For simple regression, R2 is the square of the correlation coefficient
Reflects variance accounted for in data by the best-fit line
第九章 多元回归分析
浙江师范大学教育学院心理系
徐长江 xucj@
纲要
回归分析的基本原理
一元回归分析 多元回归分析
多元回归分析的方法 多元回归分析的实现
回归分析的目的
设法找出变量间的依存(数量)关系, 用函数 关系式表达出来
Example: Height vs Weight
Takes values between 0 (0%) and 1 (100%) Frequently expressed as percentage, rather than decimal
Simple Regression
Low values of R2
300 250 200 150 100 50 0 0 100 200 300
How well does a model explain the variation in the dependent variable?
Effectiveness vs Efficiency
Effectiveness: maximises R2
Drug A (dose in mg)
Drug B (dose in mg)
Good fit R2 high High variance explained
Moderate fit R2 lower Less variance explained
例子
数据t2_1.sav的数据是我国分地区家庭年 人均食品支出与人均年收入的数据。以 食品支出为因变量,人均年收入为自变 量,建立回归方程。
Calculated as:
R2adj = 1 - (1-R2)(n-1)/(n-k-1) where: n = number of data points k = number of regressors Note that R2adj will always be smaller than R2
假设
H0 : 1 0, H1 : 1 0,
如果 H 成立,则不能认为 y 与 0
x 有线性相关关系。
三种检验方法:F检验法、t-检验法、r检验法。
一元线性回归方程的方差分析
ˆ ( y y)
( y y)
ˆ ( y y)
ˆ y a bx
ˆ ˆ ( y y ) ( y y) ( y y )
ˆ ∑(y - y)2表示总平方和(总变异)中已被x与y的线性关系 所说明的那部分,可记为SSR
ˆ ∑(y- y )2即偏离回归线的平方和,用最小二乘法求回归方程时曾 使它极小,一般称这个平方和为误差平方和或剩余平方和,记为SSe
Testing for Significance: F Test 显著性F检验
ˆ y a bx
Where: a = 截距(intercept) (constant) b = 斜率(slope of best-fit line)
200 180 160 140 120 100 80 60 40 20 0 0 50 100 150 200 250
回归系数(regression coefficient)
R2 = 0 (0% - randomly scattered points, no apparent relationship between X and Y) Implies that a best-fit line will be a very poor description of data
ˆ ( y y) ( y y)
2 2
即,相关系数的平方等于回归平方和在总平方和中所占的比率。 是两个变量共同变异部分的比率,叫做决定系数 (Coefficient of determination)( R square)。 表示使
用X去预测Y时的预测释力,即Y变量被自变量所解 释的比率。反映了由自变量与因变量所形成的线性 回归模式的契合度(goodness of fit) 此一数值是否具有统计上的意义,反映了此一回归 分析或预测力是否具有统计上的意义,必须通过F检 验来判断
回归
ˆ- SSR=∑( y y)2 dfR=1
MSR= SSR / dfR
误差
SSe=∑(y-y)2 dfe=N-2 MSe= SSe / dfe ˆ
Total(全体) SSt=∑(y-y)2 dft=N-1
Testing for Significance: t Test 显著性t检验
假设
H0: 1 = 0 H1: 1 0
Multiple Regression
R2 - “Goodness of fit”
For multiple regression, R2 will get larger every time another independent variable (regressor or predictor) is added to the model New regressor may only provide a tiny improvement in amount of variance in the data explained by the model Need to establish the value of each additional regressor in predicting the DV
S ymptom Index
120 100 80 60 40 20 0
100 80 60 40 20 0 0 50 100 150 200 250
Drug A (dose in mg)
Drug B (dose in mg)
Very good fit
Moderate fit
回归方程有效性的检验
对于任何一组数据 ( xi , yi ) (i 1,2,, n),都可按最 小二乘法确定一个线性函数,但变量 y 与 x 之间是否真 有近似于线性函数的相关关系呢?尚需进行假设检验。
Simple Regression
R2 - “Goodness of fit”
180 160 140
160 140 120
S ymptom Index
0 50 100 150 200 250
S ymptom Index
120 100 80 60 40 20 0
100 80 60 40 20 0 0 50 100 150 200 250
Strong positive correlation between height and weight Can see how the relationship works, but cannot predict one from the other
Graph One: Relationship between Height and Weight
Multiple Regression
Establish equation for the best-fit line: y = b1x1 + b2x2 + b3x3 + a
Where: b1 = regression coefficient for variable x1 b2 = regression coefficient for variable x2 b3 = regression coefficient for variable x3 a = constant
Simple Regression
High values of R2
300 250
200
150
100
50
0 0 50 100 150 200 250 300
250 200 150 100 50 0 0 50 100 150 200 250
R2 = 1 (100% - points lie directly on the line - perfect relationship between X and Y) Implies that a best-fit line will be a very good description of data
假设
H0: 1 = 0 H1: 1 0 F = MSR/MSE 拒绝规则 如果F > F ,拒绝 H0 其中 F 是分子自由度为1,分母自由度为n - 2 的 F分布. MSR=SSR/自变量个数, MSE=SSE/n-2
检验统计量
回归方程的方差分析表