当前位置:文档之家› 牛顿法与高斯牛顿法

牛顿法与高斯牛顿法

m j =1,...,s
where
1 2
i zj i=1
− h j ( x j , yi ) ,
2
i = zj
1 if yi is of category j , 0 otherwise.
|| x k +1 - x*|| o(|| x k - x*||) || x k - x*|| = lim k→∞ || xk - x*|| = 0
superlinear convergence! (see p. 64)
CONVERGENCE BEHAVIOR OF PURE FORM
g(x) = e x - 1
p(j |y ) = P (object w/ feature vector y is of category j ) Assign object with feature vector y to category j ∗ (y ) = arg max p(j |y ). • If p(j |y ) are unknown, we can estimate them using functions hj (xj , y ) parameterized by vectors xj . Obtain xj by minimizing
m
1 2
zi − h(x, yi )
i=1
2
• Example of a linear model: Fit the data pairs by a cubic polynomial approximation. Take h(x, y ) = x3 y 3 + x2 y 2 + x1 y + x0 , where x = (x0 , x1 , x2 , x3 ) is the vector of unknown coefficients of the cubic polynomial.
r
min
i =1
|| zi − h ( xi , yi ) ||2
THE GAUSS-NEWTON METHOD • Idea: Linearize around the current point xk g ˜(x, xk ) = g (xk ) + ∇g (xk ) (x − xk ) and minimize the norm of the linearized function g ˜: xk+1 = arg min n
−1 2 k ∇ f (x ) ∇f ( x k )
− Very fast when it converges (how fast?) − May not converge (or worse, it may not be defined) when started far from a nonsingular local min − Issue: How to modify the method so that it converges globally, while maintaining the fast convergence rate
dk
=−
∇2 f ( x k )
+
−1 k ∆ ∇f ( x k ) ,
whenever the Newton direction does not exist or is not a descent direction. Here ∆k is a diagonal matrix such that ∇ 2 f ( x k ) + ∆k ≥ 0 − Modified Cholesky factorization − Trust region methods
Model Construction – Curve Fitting:
x∈ y∈ z∈
n p r
unknown parameters model's input model's output
assume z = h ( x, y ), h a known function Problem: find x that matches best the data, i.e., that minimizes the sum of squared errors:
k 0 1 2 3 4000 - 0.63212 0.71828 1.05091 0.20587 0.22859 0.01981 0.02000 0.00019 0.00019 0.00000 0.00000
x0 = -1
0
x2
x1
x
g(x)
x3
x1
0
x0
x2
x
MODIFICATIONS FOR GLOBAL CONVERGENCE • Use a stepsize • Modify the Newton direction when: − Hessian is not positive definite − When Hessian is nearly singular (needed to improve performance) • Use
NEURAL NETS • Nonlinear model construction with multilayer perceptrons • x of the vector of weights • Universal approximation property
PATTERN CLASSIFICATION • Objects are presented to us, and we wish to classify them in one of s categories 1, . . . , s, based on a vector y of their features. • Classical maximum posterior probability approach: Assume we know
• Global Convergence
NEWTON’S METHOD
xk+1
=
xk

αk
−1 2 k ∇ f (x ) ∇f ( x k )
assuming that the Newton direction is defined and is a direction of descent • Pure form of Newton’s method (stepsize = 1) xk+1 = xk −
i
ψi = arg min n
x∈ j =1
g ˜j (x, ψj −1 ) 2 , i = 1, . . . , m,
where g ˜j are the linearized functions g ˜j (x, ψj −1 ) = gj (ψj −1 )+∇gj (ψj −1 ) (x−ψj −1 )
2
xk+1 = Arg Min
x∈ n
1/2|| g (x, xk)||2
|| g (x, xk)||2 = ||g(xk)||2 + 2 g(xk)’ ∇g(xk)’ (x-xk) +(x-xk)’ ∇g(xk) ∇g(xk)’ (x-xk) is min at xk+1 when its gradient is 0: 2 ∇g(xk) g(xk) + 2 ∇g(xk) ∇g(xk)’ (xk+1 -xk) = 0 or, if ∇g(xk) ∇g(xk)’ is invertible: (xk+1-xk) = - [∇g(xk) ∇g(xk)’]-1 ∇g(xk) g(xk) , i.e., xk+1= xk - [∇g(xk) ∇g(xk)’]-1 ∇g(xk) g(xk) .
x∈
1 2
g ˜(x, xk )
2 −1
=
xk −
∇g ( x k ) ∇g ( x k )
∇g ( xk ) g ( x k )
• The direction − ∇g ( x k ) ∇g ( x k )
−1
∇g ( xk ) g ( x k )
is a descent direction since ∇g (xk )g (xk ) = ∇ (1/2) g (x) ∇g ( x k ) ∇ g ( x k ) > 0
LEAST-SQUARES PROBLEMS
m
minimize subject to
f ( x) = x∈
1 2
g ( x)
2
=
1 2
g i ( x)
i=1
2
n, n
where g = (g1 , . . . , gm ), gi :

ri .
• Many applications: − Model Construction – Curve Fitting − Neural Networks − Pattern Classification
MODEL CONSTRUCTION • Given set of m input-output data pairs (yi , zi ), i = 1, . . . , m, from the physical system
• Hypothesize an input/output relation z = h(x, y ), where x is a vector of unknown parameters, and h is known • Find x that matches best the data in the sense that it minimizes the sum of squared errors
−1
− If g (x) = ∇f (x), we get pure form of Newton
相关主题