哈工大数理统计ppt
Chapter 1 Summarizing Data
Methods Based on the Cumulative Distribution Function Histograms, Density Curves and Stem-and-Leaf Plots Measures of Location Measures of Dispersion
Denote the ordered batch of numbers by x(1) < x( 2 ) < L < x( n ) , then the ecdf can be expressed as
0 Fn ( x) = k n 1 x < x(1) x( k ) ≤ x < x(k +1) x ≥ x(n)
Stem-and-Leaf Plots
Example beeswax.sas
7 9 18 23
Measures of Location
The Arithmetic Mean For a batch of numbers x1 , x2 ,L, xn , the most commonly used measure of location is
f h ( x) =
∑ w (x − x ) n
i =1 h i
where h is a chosen bandwidth.
Example
Beeswax Solutions Analysis Interactive Data Analysis (Find beeswax.sas from Work) Analyze Distribution Output Density Estimate Normal (kernel density)
− 2 1 x 1 1 −2( h ) 1 wh ( x) = w( ) = e = e 2h h h h 2π 2π h
2 2
Let x1 , x2 ,L, xn be a sample from a probability f , then wh ( x − xi ) is the normal density with mean x i and standard deviation h ; The kernel probability density estimate of f is then given by 1 n
which is the instantaneous rate of mortality of an individual alive at t. The log of the empirical survival function is defined as 0 t < t(1)
k log S n (t ) = log(1 − ) n +1 log(1 − n ) n +1 t( k ) ≤ t < t( k +1) t ≥ t( n )
The Survival Function
If T denotes time until failure or death with cdf F , the survival function is defined as
S (t ) = p (T > t ) = 1 − F (t )
which is simply the probability that the life time will be longer than t . The empirical survival function is given by
Comparing Two Samples by using Q-Q plot
Are sample x1 , x2 ,L, xn and y1 , y2 ,L, yn from the same distribution? x The empirical k (n + 1)th quantile of x'' s is x(k ) ; The empirical k (n + 1)th quantile of y' s is y(k ) ; The dots ( x ( k ) , y ( k ) ) on the plane would be approximately a straight line if the sample comes from the same distribution.
Example
SAS data set: beeswax.sas Solutions Analysis Interactive Data Analysis (Find beeswax.sas from Work) Analyze Distribution Output Normal Q-Q plot
62.7 ≤ x < 63 63 ≤ x < 63.3 63.3 ≤ x < 63.6 63.6 ≤ x < 63.9 63.9 ≤ x < 64.2 64.2 ≤ x < 64.5
Density Curves—Kernel Probability Density Estimation
Let w(x) be the standard normal density, then the rescaled version of w(x) , wh (x) is defined as x 1 x which is the normal density with standard deviation h ;
0 Fn ( x) = k (n + 1) n (n + 1) x < x(1) x( k ) ≤ x < x( k +1) x ≥ x( n)
Properties of the Empirical Cumulative Distribution Function
Theorem 1
0.0565((1 59) ÷ 0.3) 0.452(8 59) ÷ 0.3) 1.3559((24 59) ÷ 0.3) density= 0.8475((15 59) ÷ 0.3) 0.339((6 59) ÷ 0.3) 0.2825((5 59) ÷ 0.3)
E(Fn (x)) = F(x)
1 Var( Fn ( x)) = F ( x)(1 − F ( x)) n
n→∞ x
Theorem 2
p ( lim max Fn ( x ) − F ( x ) = 0 ) = 1
That is , Fn (x) tends to F (x) simultaneously with probability one.
Examples
Plot the ecdf of this batch of numbers: 1,14,10,9,11,9 SAS data set: beeswax.sas Solutions Analysis Interactive Data Analysis (Find beeswax.sas from Work) Analyze Distribution Output Cumulative Distribution function Empirical
t 0
Quantile-Quantile Plots
The p th quantile of the distribution is the value of x p such that or x p = F −1 ( p) F (xp ) = p
1
F (x)
p
Pth quantile
xp
x
The empirical quantile of data
The Empirical Cumulative Distribution Function(ecdf)
Suppose that x1 , x2 ,L, xn is a batch of numbers. The ecdf is defined as
Fn ( x) = 1 (# xi ≤ x) n
For the given sample x1 , x2 ,L, xn , the ecdf Fn ( x) = k (n + 1) for x(k) ≤ x < x(k+1) or Fn (x(k) ) = k (n +1) ; Let Fn (x(k) ) = k (n +1) , thus the data is assigned to x(k ) ;
S n (t ) = 1 − Fn (t )
where Fn (t ) is the ecdห้องสมุดไป่ตู้ of random variable T .
The Hazard Function
The hazard function is defined as
f (t ) F ′(t ) d h (t ) = = = − log s (t ) 1 − F (t ) 1 − F (t ) dt
Histograms
Example: beeswax.sas
1 8 24 frequency = 15 6 5 62.7 ≤ x < 63 63 ≤ x < 63.3 63.3 ≤ x < 63.6 63.6 ≤ x < 63.9 63.9 ≤ x < 64.2 64.2 ≤ x < 64.5
Summarizing Data Comparing Two Samples The Analysis of Variance Linear Least Squares
What we should learn?
Mathematics(Statistics)