Linear Algebra and Algorithms

Aus FernFH MediaWiki
Version vom 11. März 2025, 22:48 Uhr von SAFFER Zsolt (Diskussion | Beiträge) (Die Seite wurde neu angelegt: „<span id="linear-algebra-and-algorithms"></span> = Linear Algebra and Algorithms = For a comprehensive subject on linear algebra the reader is referred to the book [Lang(1987)]. <span id="linear-algebra"></span> == Linear Algebra == <span id="basic-terms-and-definitions"></span> === Basic terms and definitions === A vector is a one-dimensional array of scalars, i.e. real or complex numbers. We will denote a vector by lowercase bold letters, like e.g.…“)
(Unterschied) ← Nächstältere Version | Aktuelle Version (Unterschied) | Nächstjüngere Version → (Unterschied)
Zur Navigation springen Zur Suche springen

Linear Algebra and Algorithms

For a comprehensive subject on linear algebra the reader is referred to the book [Lang(1987)].

Linear Algebra

Basic terms and definitions

A vector is a one-dimensional array of scalars, i.e. real or complex numbers. We will denote a vector by lowercase bold letters, like e.g. . In linear algebra two forms of vectors are distinguished: row vector and column vector. For example

is a row vector and

is a column vector.

An matrix is a two-dimensional array of scalars having rows an columns. The scalar elements of matrix can be real or complex numbers. We consider only real matrices, i.e. matrices with real elements. Matrices will be denoted by uppercase bold letters. For example matrix

is a matrix. Matrices can be seen as generalization of vectors, and thus vectors as special cases of matrices, where either the number of columns or rows is one. So the -element row vector is an matrix and the -element column vector is an matrix.

A square matrix has the same number of rows and columns, i.e. is of type , and is called as the order of the square matrix. An example of a square matrix is given below.

A diagonal matrix is a special square matrix, in which only the diagonal elements can differ from . An example for the diagonal matrix is given as

The diagonal matrix can be given also by means of a diag() operation by listing only the diagonal elements of the matrix as its arguments. For example the diagonal matrix can be given on such a way as

The element in the -th row and -th column of a matrix is referred as the -th element of that matrix and is denoted by . It is usual to construct a matrix by the help of the group operator . If denotes a defining formula of double indexed scalars for some range of indices and then matrix can be given by

which means matrix is composed as grouping the scalars by two dimensions, where describes the row index and the column index. Therefore the -th element of matrix is set to for every values of and in their given ranges. For example if the double indexed scalars are defined as

then defining matrix as

leads to

Matrix is called identity matrix. The identity matrix for any is characterized by having the value in its every diagonal positions and all its other elements are . Therefore the identity matrix is a special diagonal matrix. Let stands for the column vector having the value on its -th position, while its every other elements are . In an -dimensional Euclidean space the vector for represents the unit vector in the -th dimension. For example for the unit vectors are given as

Then the identity matrix can be also expressed as row vector of the unit vectors as

Elementary matrix operations

An elementary univariate operation on matrices is the transpose operation. The transpose of matrix is defined by exchanging its -th element by its -th element for each pair of and in their given ranges. The transpose of a matrix is called transposed matrix and transpose of a given matrix is denoted by . For example the transpose of the above defined matrix is given by

The transpose of the matrix is an matrix and thus the transpose operation changes the dimensionality of the matrix for . The transpose of square matrix remains a square matrix. The transpose of the row matrix is an column matrix and vice versa.

The multiplication of matrix by a scalar is defined elementwise, i.e.by multiplying each element of the matrix by the given scalar. For example matrix multiplied by gives

Two matrix can be added if they are of same type, i.e. both are . Similarly a matrix can be substracted form another one if they are of same type. For example if matrix is given as

then the matrix as the sum are given by

Two matrix can be multiplied if the number of columns of the first matrix and the number of rows of the second one are the same. The multiplication of the matrix by the matrix , gives an times product matrix . Let the -th, -th and -th elements of matrix , and be , and , respectively. Then the matrix multiplication

is defined by the elements of , and as

Hence the -th element of the product matrix is the (scalar) product of the -th row of matrix and the -th column of matrix both consisting of elements. In order to illustrate the matrix multiplication we define the times matrix as

Then the matrix product results in a times matrix , which is given by

Let stands for the column vector having every element set to value . It is called unit vector. Multiplying a matrix from right by gives the row sums of that matrix in a column vector form. For example

gives the row sum of matrix .

The addition of matrices is commutative and associative, i.e. the following relations hold for any matrices , and :

The matrix multiplication is distributive with respect to the matrix addition and matrix multiplication is associative, i.e. the following relations hold for any matrix , matrices and and matrix :

However matrix multiplication is in general, except from some special cases, NOT commutative, i.e. for matrices and

Therefore multiplication/product of matrix and is not specified. Instead one should speak about multiplying matrix by from right (by default) or from left.

The special classes of matrices, for which commutativity holds includes

  • multiplication by identity matrix : ,
  • multiplication of diagonal matrices and with each other: ,
  • multiplication of two powers of the same matrix : and
  • multiplication of two polynomials of the same matrix , and : , which follows from the commutativity of two powers of the same matrix by repetitive application of distributivity of the matrix multiplication.

Further useful relations are

The major difference of the elementary matrix operations comparing to their scalar counterparts is that matrix multiplication is not commutative in general. This has far-reaching consequences for the matrix theory.

Linear independence of vectors

The expression

is called as a linear combination of the vectors and the scalars are weights. Linear, since the used operations and constant multiplication are linear.

For example each vector pointing to a point in the -dimensional Euclidean space can be given as a linear combinations of the vectors , and as

We say that the vectors , and span the -dimensional Euclidean space and hence generate the whole -dimensional space. There are infinite many other vector combinations, which span the -dimensional Euclidean space, the vector are not necessarily be perpendicular to each other, like e.g. , and .

However if we consider the vectors , and then we can only cover the vectors at the plane with by the linear combinations of , and . In this case we say that the vectors , and generate (or span) only a (-dimensional) sub-space of the whole -dimensional Euclidean space. This is because the 3rd vector can be given as the linear combination of the other two as

This phenomena is characterised by saying that the vectors , and are not linear independent of each other. In fact not only can be given as linear combination of the other two, but any of the three vectors can be given as a linear combination of the other two. The vectors , and are linear dependent.

However the , and behave on another way, none of them can be given as linear combination of the other two. We say the vectors , and are linear independent. Among the vectors , and only any two of them are linear independent of each other.

It can be seen that a given number of -dimensional vectors determine the whole -dimensional space if and only if they are linear independent. If they are linear dependent and only of them are linear independent, then they generate (or span) only a -dimensional sub-space of the whole -dimensional space.

If the vectors are collected in a matrix, for example each vector put in a different row of the matrix, then the maximum number number of linear independent row vectors of the matrix is called as the rank of that matrix. The vectors can be put also in the different columns of a matrix, in which case the the maximum number of linear independent column vectors of the matrix is the rank of that matrix. It can be seen that the maximum number of linear independent rows and the maximum number of linear independent columns of a matrix are the same, so each matrix has a unique rank. The rank of the matrix is denoted by . Based on the above the rank of a matrix is the dimension of the subspace generated by its columns or rows as vectors.

It follows that the the statements below hold for the rank of a matrix.

  • The rank of an matrix is at most the smaller of and , in other words
  • The rank of the square matrix is at most its order, in other words

Determinant

The determinant is a scalar assigned to a square matrix. It depends on every elements of the matrix, hence determinant is a scalar function of the square matrix. The determinant of the matrix is denoted as (or ) and .

Expression of determinant - Leibniz formula

The determinant is a sum of signed products, where each product is composed as a multiplication of elements taken from each row of the square matrix, each of them at different column positions and every such products are included in the sum. In other words the determinant of matrix is given by

where is a permutation of the column indices and the sign function assigns to a permutation if it can be created by even number of exchanges of two numbers starting from and otherwise it gives . The number of products in the sum equals to the number of possible permutations, which is . This expression of the determinant is called Leibniz formula.

Determinant of a matrix

Let the matrix given as

For there are only two permutations, therefore the expression of can be given as

Determinant of a matrix - Sarrus rule

Let the matrix given as , in other words

For there are permutations, three of them are encountered with sign and the other three with . The expression of can be given as

This expression can be easily memorized by the help of the Sarrus rule, see in Figure 12. Copying the first two columns of the matrix right to it, the products with positive sign can be obtained along the diagonals from top to down and right, while the products with negative sign are the ones from down to top and right.

[[File:./figs/Sarrus_rule1.png]]

Unfortunately the schema of Sarrus can not be generalized for higher dimensions.

Geometric interpretation

The columns of the matrix can be interpreted as vectors in the -dimensional Euclidean space. For these vectors span a parallelogram, and the determinant is exactly the signed value of the area of this parallelogram. This is shown in Figure 13.

[[File:./figs/2D_determinant_as_area_parallelogram.png]]

For the column vectors of , denoted by , and , form a parallelepiped and the determinant gives the signed volume of this parallelepiped, see in Figure 14.

[[File:./figs/3D_determinant_as_area_parallelepiped.png]]

This interpretation holds also for higher dimensions. So the column vectors of matrix span a parallelotope in the -dimensional space and the determinant of gives its signed -dimensional volume.

Properties of determinant

Determinant of several special matrices

  • Determinant of identity matrix is .
  • Determinant of diagonal matrix is .

The determinant of matrix has the following properties regarding row and column manipulations.

  1. Multiplying matrix by constant results in multiplication of the determinant by , in other words
  2. Multiplying any row or any column of by constant leads to a multiplication of the determinant of by . In other words the determinant of the modified matrix is .
  3. Exchanging two rows or two columns of leads to a multiplication of the determinant by .
  4. Adding a scalar multiplication of another row to a row of or adding a scalar multiplication of another column to a column of does not change the value of the determinant.

Further useful properties of the determinant are

Laplace expansion

The minor of of the matrix is defined as the determinant of the submatrix of , which is obtained by omitting the -th row and -th column of the matrix , for . The determinant of the matrix can be expressed by expanding it along its -th row. This gives a sum of signed product of each element of the -th row by its corresponding minor. In other words

This is called as Laplace expansion along the -th row of matrix .

As an example we show the Laplace expansion of the -dimensional square matrix with general notations along its 1st row, which leads to

The Laplace expansion along the -th column of matrix can be also defined on similar way.

Determinant and linear independence

One of the most important use of the determinant is its relation to linear independence. The determinant of the matrix is non-zero if and only if the rows (and columns) of matrix are linear independent.

This can be seen e.g. by the help of the geometric interpretation of the determinant. If the column vectors of matrix are linear independent then they generate the whole -dimensional space and thus the -volume of the parallelotope spanned by them must be non-zero and therefore also the determinant is non-zero. However if the column vectors of matrix are linear dependent then column vectors of matrix generate only a -dimensional subspace of the whole -dimensional space, which implies that the -volume of the parallelotope spanned by them must be zero and hence also the determinant is zero.

Inverse matrix

We consider the question whether a matrix does exist, which multiplied by the square matrix results in the identity matrix. It would be an analogue of the reciprocal number in the set of real (or complex) numbers. Immediately the following questions arise

  • What is the condition of the existence of an such matrix ? Not every number has reciprocal also in the set of real numbers (i.e. the number zero).
  • If such matrices exist for multiplying by both from left and right, then they are the same ?, This question is due to the general non-commutativity of the matrices.

The adjugate matrix

The adjugate matrix of the matrix is denoted by , and it is defined as the transpose of the matrix of the signed minors. More precisely is given by its elements as

The importance of the adjugate matrix lies in the following relation, which holds for every square matrix :

The inverse matrix

The matrix is an invertable matrix if there exist a matrix, which is multiplied by from left gives the identity matrix and multiplied by from right also gives the identity matrix. This matrix is called the inverse matrix of and denoted by . Thus for the inverse matrix holds the following defining relation:

Dividing the above relation for by , it leads to expression of the inverse matrix as

Since the adjugate matrix always exists, the inverse matrix exists if and only the determinant of is non-zero. The square matrix which is not invertable is also called as singular matrix. Similarly the invertable matrix is also called as non-singular. Therefore

  • The square matrix is singular if and only if its determinant is zero.
  • The square matrix is non-singular if and only if its determinant is non-zero.

Inverse of a matrix

Let the matrix given as

The determinant of can be given as Due to all the minors are scalars. Hence the inverse matrix can be expressed as

This can be checked by multiplying by

which is as expected. It is easy to remember the nominator of the inverse of matrix: the values in the main diagonal (a and d) are exchanged and the values in the secondary diagonal (b and c) are multiplied by .

Inverse of a matrix

Let the matrix given as , in other words

Due to all the minors are determinants of -dimensional matrix. Hence the inverse matrix can be expressed as

Properties of the inverse matrix

Inverse matrix of several special matrices

  • Inverse matrix of identity matrix is itself. is .
  • Inverse matrix of diagonal matrix is also diagonal.

, :
.

Further useful properties of the inverse matrix are

Linear systems of equations

The linear system of equations consisting of unknowns and equations can be given as

This can be also called as linear system of equations. By introducing the matrix and the column vector and the column vector as

the system of linear equations can be rewritten in matrix vector form as

Here is the coefficient matrix, is a column vector of unknowns or unknown column vector and is given column vector.

Reformulation - linear combination of column vectors of matrix

The above system of linear equations can be rewritten by the help of the group operator on index as

where is a vector due to the application of the group operator . The order of grouping on index and summing can be exchanged on the left hand side of the equation, which gives

Introducing the column vectors composed from the columns of as

the relation can be further rearranged as

Based on this formulation, the problem of solving the above linear system of equations can be seen as finding the weights -s, with which the given vector can be composed as the linear combination of the - dimensional vectors , .

This interpretation enables to establish criteria for solvability for different cases of the linear system of equations depending on the relation between and as well as the linear independence of the vectors , and .

Solvability - general case

The following statements follow from the above linear combination interpretation for the solvability of the linear system of equations.

  1. The linear system of equations is solvable if and only if vector can be composed as linear combination of the vectors , , in other words if the vectors , and vector are linear dependent. This means that adding vector to the set of vectors , does not increase the number of linear independent vectors in the set. This is equivalent to the statement that the extended matrix obtained as adding column vector as column to matrix , has the same rank as matrix .
  2. If then two cases must be distinguished according to the relation between and .
    • If then the vectors , generate the -dimensional space, in which the weights of the -dimensional vector is unique, therefore the system has a unique solution. Note that in this case must be, since can not be greater than .
    • If then the vectors , generate a -dimensional space, in which only components of the the -dimensional vector can be composed by linear combination of the the vectors , . Therefore in this case unknowns can be freely selected and the system has infinite many solutions. Note that in this case , so it can be also smaller than .

The different cases of solvability and their criteria are summarized in Table 1.

Solvability criteria of linear system of equations
 Rank criterion     General case   Special case  
         
        Homogeneous system  
    No solution     - Not possible -  
    Unique solution     Only trivial solution  
and          
    Infinite many     Also non-trivial  
and     solutions     solutions,  
        infinite many  
    unknowns can be      
    chosen freely    free parameters  


Solvability - square matrix

Several further findings can be obtained for the special case of square matrix . If then matrix is non-singular and . Otherwise matrix is singular and

Taking all these into account the solvability conditions for the square matrix can be summarized in a slightly simplified form, which is shown in Table 2.

Solvability criteria of linear system of equations
 Rank criterion     Inhomogeneous system   Homogeneous system 
         
    Unique solution     Only trivial solution  
 (        
    Infinite many    Also non-trivial  
( ) and     solutions    solutions,  
        infinite many 
    unknowns can be      
    chosen freely     free parameters  
    No solution     - Not possible -  
( ) and          
         


Eigenvectors, eigenvalues, spectral decomposition

For a given square matrix finding the scalars and column vectors satisfying the equation

called as eigenvalue problem of matrix . The scalars satisfying the equation are called eigenvalues of matrix and the vectors satisfying the equation are called eigenvectors of matrix . Transforming an eigenvector by multiplying it by gives a vector being parallel to the considered eigenvector. Hence the transformation by does not change the direction of the eigenvectors and their length will be multiplied by the eigenvalue. This explains the names eigenvectors and eigenvalues.

Characteristic polynomial

Rearranging the above relation gives

This is homogeneous system of linear equations, which has non-trivial solution only if the determinant of the coefficient matrix is zero, in other words, if

Observe that arises in each element of the main diagonal of matrix . It follows that the permutation giving the products of the main diagonal and therefore also the determinant of this matrix is an -order polynomial of . This -order polynomial of is called the characteristic polynomial of matrix . Its solutions give the eigenvalues. According to the fundamental theory of algebra, an -order polynomial has exactly solutions in the set of complex numbers. It follows that there exist complex eigenvalues implying at most real eigenvalues, from which some of them can arise more times.

Algebraic multiplicity and geometric multiplicity

Let , , denote the eigenvalues of matrix . The number of arising of the eigenvalue in the solution of the characteristic polynomial is called the algebraic multiplicity . For each the homogeneous system

determines the eigenvectors belonging to the eigenvalue . Due to the homogeneous character of the system, its solution must have at least one free parameter. This is manifested in the fact that any constant multiplication of an eigenvector is also a solution of a system. Hence an eigenvector is determined only up to a constant multiplication, i.e. only the direction (in the -dimensional Euclidean system) of the eigenvector is determined, but its length not. The number of linearly independent eigenvectors belonging to the eigenvalue can be one or more depending on the rank of matrix . In fact linearly independent eigenvectors belonging to the eigenvalue is the number of freely selectable parameter, which is , which is called the geometric multiplicity of .

An example for determining eigenvalues an eigenvectors

Matrix is given as

  1. Computing the eigenvalues of the matrix



  1. Determining the eigenvectors of the matrix
    • The eigenvector belonging to





  

    • The eigenvector corresponding to




      or    ,  since the eigenvectors are only determined up to a multiplicative constant.

Left and right eigenvalues and eigenvectors

So far we considered the form of the eigenvalue problem, in which the vector being a column vector and hence locates on the right side of matrix . From this reason this eigenvalue problem is also called as right eigenvalue problem, and the -s and -s are also called as right eigenvalues and right eigenvectors of matrix , respectively.

There is a similar eigenvalue problem, in which the vector is a row vector and arises on the left hand side of the matrix as

and it called as left eigenvalue problem. The solution -s and -s are also called as left eigenvalues and left eigenvectors of matrix , respectively.

The two kind of eigenvalue problem can be related by transposing them into each other. For example transposing the right eigenvalue problem gives

This shows the relation between the two kind of eigenvalue problem, which can be formulated as follows. The right side eigenvalues and eigenvectors of square matrix are the left side eigenvalues and eigenvectors of the transpose of matrix , respectively. It follows that for a symmetric matrix the right and the left side eigenvalues and eigenvectors are the same, respectively, due to S.

Properties of eigenvalues

Let and stands for the eigenvalues and the -th eigenvalue, of matrix , respectively. Useful properties of the eigenvalues are given as

Spectral decomposition

The spectral decomposition of square matrix is a factorization into the canonical form as

where is a diagonal matrix. The spectral decomposition is also called eigendecomposition or diagonalisation.

The matrices arising in the factorization can be interpreted as follows. The diagonal elements of the diagonal matrix are the right eigenvalues of matrix . Note that each eigenvalue arises in the diagonal as many times as its algebraic multiplicity. The column vectors of matrix are the right eigenvectors belonging to the eigenvalues, . Thus and satisfies

which can be arranged in matrix form as

Note that the fact that is determined only up to a constant, does not influence the canonical form of the spectral decomposition. This can be seen by replacing by , in matrix form by with , in the canonical form, which leads to

where we utilized the commutativity of the diagonal matrices .

An example for spectral decomposition

Continuing the the example for determining eigenvalues an eigenvectors of the matrix

now we show the spectral decomposition of matrix .

So far we have determined the right eigenvalues and , and the right eigenvectors and belonging to them.

  1. Composition of diagonal matrix The diagonal elements of matrix are the eigenvalues and . Thus matrix is given by

  2. Composition of matrix The columns of matrix are the eigenvectors and . Thus matrix is given as

  3. Determination of the inverse of matrix

      

  4. The spectral decomposition of matrix

Conditions for the existence of spectral decomposition

Note that not every square matrix is diagonalisable. Two useful conditions for matrix to be diagonalisable can be given as

  1. The spectral decomposition of matrix exists if and only if the number of linearly independent right eigenvectors of equals to .
  2. A sufficient condition for the existence of spectral decomposition of matrix is that its characteristic polynomial has no repeated roots.

Based on the first condition, the canonical form of the spectral decomposition can be derived. The column vectors of are linear independent, so it is non-singular and hence exists the inverse matrix . Multiplying the matrix form relation by from right gives

from which falls out resulting the canonical form of the spectral decomposition.

Application of spectral decomposition

Spectral decomposition is used im many area of the natural sciences. Below we list several typical example for using spectral decomposition.

  1. Calculating a power of matrix
  2. Expressing the inverse of matrix
    Such expression of the inverse of matrix exists if and only if for every .
  3. Expressing matrix function of matrix .

Here defined by the help of power series of f()

  1. Showing the convergence of

Matrix norms

Eigenvalues can be considered to be a set of numbers characterizing the magnitude of a square matrix. Often the eigenvalue with the highest magnitude has outstanding importance. Besides of it, there is a need for characterising the magnitude of a square matrix only by one number, analogously to the absolute value of a real or complex number.

The matrix norm is a measure characterizing the magnitude of a square matrix by one number. The matrix norm is denoted by , e.g. the norm of the square matrix is .

Definition of several matrix norms

On the contrary to the uniqueness of the absolute value, more matrix norms can be defined. Let be the -th element of the square matrix . Then several martix norms can be given as

  1. Max norm
    Note that there is another, similar norm which is also called as max norm.
  2. Max row sum
  3. norm

General properties of matrix norms

Matrix norms are useful tools e.g. in theoretical derivations, like e.g. to prove the existence of an upper limit. In fact in most of the cases the concrete form of the matrix norm is not interesting. Instead some properties of the matrix norm are utilized. These important general properties of matrix norm are listed for square matrices and as

Useful matrix norms also have the following additional property

.

All the three above defined matrix norms have this additional property. Let and be the -th and -th element of the square matrix and , respectively . Then the additional property can be shown for these matrix norms as follows.

  1. Max norm
  2. Max row sum
  3. norm

Several typical applications of matrix norms are given below.

  • To show that a norm of a matrix expression is upper limited.
  • To show that , when .
  • To show that convergent is, when .

Application of linear algebra

Solving system of equations

System of linear equations

Let us consider the system of linear equations with unknowns and equations, i.e. with square matrix . This system of linear equations can be written in matrix vector form as

Inhomogeneous system with

Let us consider the case with , which is an important subclass of system of linear equations. Then there is always a unique solution, since the rank can not be changed via extending to , due to . The solution can be given in closed form by multiplying the matrix vector equation by from left, which gives the closed form solution as

Note that the inverse matrix exists in this case, since matrix is non-singular due to .

Homogeneous system with

Recall that the equation

holds for the matrix . If then this leads to

Let denotes the -th column vector of , . It follows that any column vector is a solution of the homogeneous equation with . If additionally , then there is only one free parameter. Since any constant multiplication of a column vector is also a solution, it follows that

  • , gives every solution of the homogeneous equation with and
  • the columns of are constant multiplications of each other if .

System of non-linear equations

Let us consider the following system of non-linear equations with unknowns and equations:

Introducing the notations

this can be rewritten as

Solving of this system of non-linear equations is equivalent with minimization problem

since the square of each element in is non-negative. The gradient of can be expressed as

where is the Jacobian matrix of . Then the minimization problem can be solved numerically by applying the gradient descent algorithm. This can be executed by initializing as

and performing the iterative steps

where is the step size and can be determined by any of the line search methods (see in 3.2.3).

Numeric optimization

Gradient methods

The gradient methods are numeric methods for unconstrained mathematical optimization. They can be applied to differentiable multivariate functions. They are usually formulated as finding the minimum of a multivariate function. All of them use the gradient of the multivariate function, this explains the name.

The basic idea of gradient methods is to decrease the function value in each step of the algorithm. It is well known that at the given point of the multivariate function, x, the infinitesimal difference in the function value is the highest in the direction of the gradient at that point, . Therefore the steepest decent is in the opposite direction of the gradient. The iterative step of any gradient methods takes a step in a direction which is related to the negative gradient at the current point, . The step size is denoted by . Hence the iterative step of any gradient methods can be given as

Note that both the step size and the vector can depend on the iteration step.

Requirements and conditions to decrease the function value in each iteration

In order to decrease the function value in each step, two requirements must be made.

  • It must be ensured that the infinitesimally shows into a descent direction.
  • The step size must be enough small to avoid overshooting the nearest local minimum and thus to avoid leading to divergence.

The first requirement is equivalent with showing that the negative gradient and and vector lie on the same side of the hyperplane being perpendicular to the gradient. In this case their scalar product is positive. This leads to the condition ensuring as

The second requirement can be fulfilled by formalising "enough small" by specifying a demanded amount of decrease in function value. This is done by the Armijo condition demanding that the decrease in function value should be at least proportional to the product of the step size and the directional derivative . Hence the can be fulfilled by condition , which is the Armijo condition and given as

where is the constant of proportionality. In the practice usually is set to small values, like e.g. . Observe that on both sides of this inequality there are negative values.

Determination of the step size

If the step size were set to too small then the convergence would become too slow. This leads to a third requirement as
The step size must be enough large to avoid slowing the convergence. The requirements and together ensure an optimal step size.

The three most important search methods for determining the optimal step size are

  • Exact line search
  • Inexact line search
  • Backtracing line search

The exact line search determines the optimal step size by minimizing the function value after taking the next step. This leads to a one-dimensional minimization problem as

The exact line search are used only seldom in practice due to its high computing effort.

In the inexact line search the step size, starting from an initial value, like e.g. , is iteratively decreased until the function value after the step will be less than the current one, in other words until

The backtracing line search searches for "enough small" step size by utilizing the Armijo condition. Starting from an initial step size, the step size is iteratively multiplied by a properly selected constant of proportionality, until the Armijo condition fulfils.

While exact line search fulfils both requirements and , inexact line search fulfils only and backtracing line search lies somewhere in between. Because of this and its simplicity, it is often used in the practice.

The pseudo-code of an algorithm for gradient method with backtracing line search can be seen in Algorithm 9.


Algorithm 9 Gradient method - with backtracing line search
—————————————————————————————
Inputs: - multivariate function
- initial value
- precision value
Outputs:
- found local minimum place
- local minimum
—————————————————————————————
1 Initialisation ,
2 while
3
4    Compute gradient
5    Compute vector
   —– Backtracing line search - begin —
6    Init backtracing line search , set
7    while
8
9    end
   —– Backtracing line search - end —-
10   Update as
11   end
12   return and
—————————————————————————————


The gradient descent

The gradient descent algorithm is a special case of the gradient methods by setting to be the steepest descent, in other words . Thus gradient descent is a first-order method and its iterative step can be given as

In the context of deep learning the step size is also called as learning rate. The pseudo-code of the gradient descent algorithm is given in Algorithm 10.


Algorithm 10 Gradient descent algorithm
—————————————————————————————
Inputs: - multivariate function
- initial value
- precision value
Outputs:
- found local minimum place
- local minimum
—————————————————————————————
1 Initialisation ,
2 while
3
4    Compute gradient
5    Update learning rate by setting
6    Update as
7    end
8    return and
—————————————————————————————


The convergence to local minimum can be guaranteed under certain assumptions on function and specific choice s of the dependency of learning rate on , , and . If function is convex, then local minimum is also the global minimum.

The extension of gradient descent, the stochastic gradient descent algorithm and its improved variants are the most commonly used numeric optimization algorithms for training deep neural networks.

Further special cases of gradient methods

Several further algorithm can be obtained as special cases of gradient methods by setting vector specially. These includes

  • Diagonal scaled gradient descent and
  • Newton’s method.

The diagonal scaled gradient descent algorithm is obtained by setting
, where is a diagonal matrix. Note that just like vector , matrix can also depend on the iteration step.

The Newton’s method, also called as Newton–Raphson method applied to optimization can be also obtained by setting with , where is the Hessian matrix. Since the Hessian matrix includes second order derivatives of function , Newton’s method is a second order optimization method. For the class of convex functions the Newton’s method converges to the minimum quadratically fast.

Second-order algorithms

The second-order algorithms use the second order information in their original formulation from the Hessian matrix. The following second-order methods are considered

  • Newton’s method
  • Quasi-Newton methods
  • Non-linear conjugate gradient algorithm

Newton methods

Newton’s method for local optimization is based on the second order Taylor-expansion of the function to be optimized. We apply it to minimize the function . The second order Taylor-expansion of around is given as

The method chooses the next value of the parameter vector as the minimum of the above Taylor-expansion form. It is well known that the necessary condition of being the minimum is that the gradient of must be . Computing the gradient of with respect to and setting it to gives
from which the next value of the parameter vector can be expressed by setting as
which is the same formula as we got by setting in the general iteration relation of the gradient methods. This recursive computational rule utilizes also the second order information in Hessian matrix. If the function to be minimized is locally quadratic then the method jumps in one step to the minimum. Otherwise it iterates in quadratic steps and hence faster than the gradient descent. There are two major issues with the Newton’s method as

  1. Limitation - The convergence of the method is ensured only for locally convex regions. It can move in wrong direction near a saddle point.
  2. Drawback - The numerical complexity of the method is , that is higher than that of the first-order methods. Here is the number of parameters, i.e. the size of vector . This is due to the computation of the inverse Hessian in each iteration step.

Quasi-Newton methods

In many applications, like e.g. in machine learning and in deep learning trainings, the the number of parameters, is usually in magnitude of thousands and millions. Therefore Newton’s method can not be applied directly due to its numerical complexity in such applications.

Instead the quasi-Newton methods can be applied, in which the Hessian matrix is approximated by the matrix , which needs much less computation effort. The matrix is chosen to satisfy the so called secant equation

This leads to a significant reduction in the computational complexity.

The most prominent quasi-Newton method is the
Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. It uses the matrix only for determining the search direction, but it does not apply the quadratic step size like in the original Newton’s method. Instead a line search is performed. If the function is not convex then the step size is determined by finding a point satisfying the Wolfe conditions. The numerical complexity of the BFGS algorithm is . However the algorithm stores the matrix in each iteration step leading to the memory need . This makes it not appropriate for using in DL.

A further improvement of BFGS is the Limited Memory BFGS (L-BFGS), whose memory need is to depending on the amount of information used from previous iteration at updating matrix . The method converges faster in locally convex environment than in a non-convex one.

Non-linear conjugate gradient algorithm

One potential problem of gradient descent algorithm, that in narrow valley it takes a zig-zag route, which slows the convergence. This is because the consecutive gradients are orthogonal and therefore some part of progress already achieved in the previous gradient direction will be undo.
The conjugate gradient algorithm overcomes this problem by selecting the consecutive directions being so called conjugate gradient directions. In case of N-dimensional quadratic function it reaches the minimum exactly after steps, since it does not undo progress made in previous directions. The next direction is determined as the linear combination of the actual gradient and the actual direction, in which the previous direction is weighted by . Since the next and actual directions are conjugate gradients, the next direction can be determined on straightforward way by the help of the eigenvectors of the Hessian of the multivariate function to be minimized. The conjugate gradient algorithm was intended to solve a quadratic minimization problem, which is equivalent with solving a system of linear equations.

The non-linear conjugate gradient algorithm is a generalization of the conjugate gradient algorithm to find a minimum of any non-linear multivariate function. In this case the algorithm does not stop after steps, so a small modification is required in order to restart search eventually in the direction of unaltered gradient. This is performed by periodic reset of to . In case of large , which is the case of e.g. in training deep learning models, the computation of the Hessian matrix requires a high computational effort. Therefore computationally more efficient ways were proposed to compute parameter . Two of the best known formulas for computing are given as follows:

  • Fletcher-Reeves formula:
  • Polak-Ribiere formula:

The pseudo-code of the algorithm is given in Algorithm 11.


Algorithm 11 Non-linear conjugate gradient algorithm
—————————————————————————————
Inputs: - multivariate function
- initial value
- precision value
Outputs:
- found local minimum place
- local minimum
—————————————————————————————
1 Initialisation ,
2 Initialize conjugate direction
3 while
4
5    Compute gradient
6    Update conjugate direction
7    Perform line search
8    Update as
9 end
10 return and
—————————————————————————————


The Quasi-Newton methods converge potentially much faster than the non-linear conjugate gradient algorithm.