Linear Algebra and Algorithms
For a comprehensive subject on linear algebra the reader is referred to the book [Lang(1987)].
Linear Algebra
Basic terms and definitions
A vector is a one-dimensional array of scalars, i.e. real or complex numbers. We will denote a vector by lowercase bold letters, like e.g.
. In linear algebra two forms of vectors are distinguished: row vector and column vector. For example

is a row vector and

is a column vector.
An
matrix is a two-dimensional array of scalars having
rows an
columns. The scalar elements of matrix can be real or complex numbers. We consider only real matrices, i.e. matrices with real elements. Matrices will be denoted by uppercase bold letters. For example matrix

is a
matrix. Matrices can be seen as generalization of vectors, and thus vectors as special cases of matrices, where either the number of columns or rows is one. So the
-element row vector is an
matrix and the
-element column vector is an
matrix.
A square matrix has the same number of rows and columns, i.e. is of type
, and
is called as the order of the square matrix. An example of a
square matrix
is given below.

A diagonal matrix is a special square matrix, in which only the diagonal elements can differ from
. An example for the diagonal matrix
is given as

The diagonal matrix can be given also by means of a diag() operation by listing only the diagonal elements of the matrix as its arguments. For example the diagonal matrix
can be given on such a way as

The element in the
-th row and
-th column of a matrix is referred as the
-th element of that matrix and is denoted by
. It is usual to construct a matrix by the help of the group operator
. If
denotes a defining formula of double indexed scalars for some range of indices
and
then matrix
can be given by
![{\displaystyle {\bf {A}}=[\ a_{ij}]\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/59d5a09497c72e293bdef864a49c188dc7850c6c)
which means matrix
is composed as grouping the scalars
by two dimensions, where
describes the row index and
the column index. Therefore the
-th element of matrix
is set to
for every values of
and
in their given ranges. For example if the double indexed scalars
are defined as

then defining matrix
as
![{\displaystyle {\bf {I}}=[\ z_{ij}]\ }](https://wikimedia.org/api/rest_v1/media/math/render/svg/b7b8a83ccc737859e53e4f93122ffc4809d98401)
leads to

Matrix
is called
identity matrix. The
identity matrix for any
is characterized by having the value
in its every diagonal positions and all its other elements are
. Therefore the identity matrix is a special diagonal matrix. Let
stands for the column vector having the value
on its
-th position, while its every other elements are
. In an
-dimensional Euclidean space the
vector
for
represents the unit vector in the
-th dimension. For example for
the unit vectors
are given as

Then the
identity matrix can be also expressed as row vector of the unit vectors as

Elementary matrix operations
An elementary univariate operation on matrices is the transpose operation. The transpose of matrix
is defined by exchanging its
-th element by its
-th element for each pair of
and
in their given ranges. The transpose of a matrix is called transposed matrix and transpose of a given matrix
is denoted by
. For example the transpose of the above defined matrix
is given by

The transpose of the
matrix is an
matrix and thus the transpose operation changes the dimensionality of the matrix for
. The transpose of square matrix remains a square matrix. The transpose of the
row matrix is an
column matrix and vice versa.
The multiplication of matrix by a scalar is defined elementwise, i.e.by multiplying each element of the matrix by the given scalar. For example matrix
multiplied by
gives

Two matrix can be added if they are of same type, i.e. both are
. Similarly a matrix can be substracted form another one if they are of same type. For example if
matrix
is given as

then the matrix
as the sum
are given by

Two matrix can be multiplied if the number of columns of the first matrix and the number of rows of the second one are the same. The multiplication of the
matrix
by the
matrix
, gives an
times product matrix
. Let the
-th,
-th and
-th elements of matrix
,
and
be
,
and
, respectively. Then the matrix multiplication

is defined by the elements of
,
and
as

Hence the
-th element of the product matrix
is the (scalar) product of the
-th row of matrix
and the
-th column of matrix
both consisting of
elements. In order to illustrate the matrix multiplication we define the
times matrix
as

Then the matrix product
results in a
times matrix
, which is given by

Let
stands for the column vector having every element set to value
. It is called unit vector. Multiplying a matrix from right by
gives the row sums of that matrix in a column vector form. For example

gives the row sum of matrix

.
The addition of matrices is commutative and associative, i.e. the following relations hold for any
matrices
,
and
:

The matrix multiplication is distributive with respect to the matrix addition and matrix multiplication is associative, i.e. the following relations hold for any
matrix
,
matrices
and
and
matrix
:

However matrix multiplication is in general, except from some special cases, NOT commutative, i.e. for
matrices
and

Therefore multiplication/product of matrix
and
is not specified. Instead one should speak about multiplying matrix
by
from right (by default) or from left.
The special classes of matrices, for which commutativity holds includes
- multiplication by identity matrix
:
,
- multiplication of diagonal matrices
and
with each other:
,
- multiplication of two powers of the same matrix
:
and
- multiplication of two polynomials of the same matrix
,
and
:
, which follows from the commutativity of two powers of the same matrix by repetitive application of distributivity of the matrix multiplication.
Further useful relations are

The major difference of the elementary matrix operations comparing to their scalar counterparts is that matrix multiplication is not commutative in general. This has far-reaching consequences for the matrix theory.
Linear independence of vectors
The expression

is called as a linear combination of the vectors
and the scalars
are weights. Linear, since the used operations
and constant multiplication are linear.
For example each vector
pointing to a point in the
-dimensional Euclidean space can be given as a linear combinations of the vectors
,
and
as

We say that the vectors
,
and
span the
-dimensional Euclidean space and hence generate the whole
-dimensional space. There are infinite many other vector combinations, which span the
-dimensional Euclidean space, the vector are not necessarily be perpendicular to each other, like e.g.
,
and
.
However if we consider the vectors
,
and
then we can only cover the vectors at the plane with
by the linear combinations of
,
and
. In this case we say that the vectors
,
and
generate (or span) only a (
-dimensional) sub-space of the whole
-dimensional Euclidean space. This is because the 3rd vector
can be given as the linear combination of the other two as

This phenomena is characterised by saying that the vectors
,
and
are not linear independent of each other. In fact not only
can be given as linear combination of the other two, but any of the three vectors can be given as a linear combination of the other two. The vectors
,
and
are linear dependent.
However the
,
and
behave on another way, none of them can be given as linear combination of the other two. We say the vectors
,
and
are linear independent. Among the vectors
,
and
only any two of them are linear independent of each other.
It can be seen that a given
number of
-dimensional vectors determine the whole
-dimensional space if and only if they are linear independent. If they are linear dependent and only
of them are linear independent, then they generate (or span) only a
-dimensional sub-space of the whole
-dimensional space.
If the
vectors are collected in a matrix, for example each vector put in a different row of the matrix, then the maximum number number of linear independent row vectors of the matrix is called as the rank of that matrix. The vectors can be put also in the different columns of a matrix, in which case the the maximum number of linear independent column vectors of the matrix is the rank of that matrix. It can be seen that the maximum number of linear independent rows and the maximum number of linear independent columns of a matrix are the same, so each matrix has a unique rank. The rank of the matrix
is denoted by
. Based on the above the rank of a matrix is the dimension of the subspace generated by its columns or rows as vectors.
It follows that the the statements below hold for the rank of a matrix.
- The rank of an
matrix
is at most the smaller of
and
, in other words 
- The rank of the
square matrix
is at most its order, in other words 
Determinant
The determinant is a scalar assigned to a square matrix. It depends on every elements of the matrix, hence determinant is a scalar function of the square matrix. The determinant of the
matrix
is denoted as
(or
) and
.
Expression of determinant - Leibniz formula
The determinant is a sum of signed products, where each product is composed as a multiplication of elements taken from each row of the square matrix, each of them at different column positions and every such products are included in the sum. In other words the determinant of
matrix
is given by

where
is a permutation of the column indices
and the sign function
assigns
to a permutation
if it can be created by even number of exchanges of two numbers starting from
and otherwise it gives
. The number of products in the sum equals to the number of possible permutations, which is
. This expression of the determinant is called Leibniz formula.
Determinant of a
matrix
Let the
matrix
given as

For
there are only two permutations, therefore the expression of
can be given as

Determinant of a
matrix - Sarrus rule
Let the
matrix
given as
, in other words

For
there are
permutations, three of them are encountered with sign
and the other three with
. The expression of
can be given as

This expression can be easily memorized by the help of the Sarrus rule, see in Figure 12. Copying the first two columns of the matrix right to it, the products with positive sign can be obtained along the diagonals from top to down and right, while the products with negative sign are the ones from down to top and right.
Unfortunately the schema of Sarrus can not be generalized for higher dimensions.
Geometric interpretation
The columns of the
matrix
can be interpreted as vectors in the
-dimensional Euclidean space. For
these vectors span a parallelogram, and the determinant is exactly the signed value of the area of this parallelogram. This is shown in Figure 13.
For
the column vectors of
, denoted by
,
and
, form a parallelepiped and the determinant gives the signed volume of this parallelepiped, see in Figure 14.
This interpretation holds also for higher dimensions. So the column vectors of
matrix
span a parallelotope in the
-dimensional space and the determinant of
gives its signed
-dimensional volume.
Properties of determinant
Determinant of several special matrices
- Determinant of identity matrix
is
.
- Determinant of diagonal matrix
is
.
The determinant of
matrix
has the following properties regarding row and column manipulations.
- Multiplying matrix
by constant
results in multiplication of the determinant by
, in other words
- Multiplying any row or any column of
by constant
leads to a multiplication of the determinant of
by
. In other words the determinant of the modified matrix is
.
- Exchanging two rows or two columns of
leads to a multiplication of the determinant by
.
- Adding a scalar multiplication of another row to a row of
or adding a scalar multiplication of another column to a column of
does not change the value of the determinant.
Further useful properties of the determinant are

Laplace expansion
The minor
of of the
matrix
is defined as the determinant of the
submatrix of
, which is obtained by omitting the
-th row and
-th column of the matrix
, for
. The determinant of the matrix
can be expressed by expanding it along its
-th row. This gives a sum of signed product of each element of the
-th row by its corresponding minor. In other words

This is called as Laplace expansion along the

-th row of matrix

.
As an example we show the Laplace expansion of the
-dimensional square matrix
with general notations along its 1st row, which leads to

The Laplace expansion along the

-th column of matrix

can be also defined on similar way.
Determinant and linear independence
One of the most important use of the determinant is its relation to linear independence. The determinant of the
matrix
is non-zero if and only if the rows (and columns) of matrix
are linear independent.
This can be seen e.g. by the help of the geometric interpretation of the determinant. If the column vectors of matrix
are linear independent then they generate the whole
-dimensional space and thus the
-volume of the parallelotope spanned by them must be non-zero and therefore also the determinant is non-zero. However if the column vectors of matrix
are linear dependent then
column vectors of matrix
generate only a
-dimensional subspace of the whole
-dimensional space, which implies that the
-volume of the parallelotope spanned by them must be zero and hence also the determinant is zero.
Inverse matrix
We consider the question whether a matrix does exist, which multiplied by the square matrix
results in the identity matrix. It would be an analogue of the reciprocal number in the set of real (or complex) numbers. Immediately the following questions arise
- What is the condition of the existence of an such matrix ? Not every number has reciprocal also in the set of real numbers (i.e. the number zero).
- If such matrices exist for multiplying by
both from left and right, then they are the same ?, This question is due to the general non-commutativity of the matrices.
The adjugate matrix
The adjugate matrix of the
matrix
is denoted by
, and it is defined as the transpose of the matrix of the signed minors. More precisely
is given by its elements as

The importance of the adjugate matrix lies in the following relation, which holds for every square matrix
:

The inverse matrix
The
matrix
is an invertable matrix if there exist a matrix, which is multiplied by
from left gives the identity matrix and multiplied by
from right also gives the identity matrix. This matrix is called the inverse matrix of
and denoted by
. Thus for the inverse matrix
holds the following defining relation:

Dividing the above relation for
by
, it leads to expression of the inverse matrix
as

Since the adjugate matrix always exists, the inverse matrix
exists if and only the determinant of
is non-zero. The square matrix which is not invertable is also called as singular matrix. Similarly the invertable matrix is also called as non-singular. Therefore
- The square matrix is singular if and only if its determinant is zero.
- The square matrix is non-singular if and only if its determinant is non-zero.
Inverse of a
matrix
Let the
matrix
given as

The determinant of
can be given as
Due to
all the minors are scalars. Hence the inverse matrix
can be expressed as

This can be checked by multiplying
by

which is
as expected. It is easy to remember the nominator of the inverse of
matrix: the values in the main diagonal (a and d) are exchanged and the values in the secondary diagonal (b and c) are multiplied by
.
Inverse of a
matrix
Let the
matrix
given as
, in other words

Due to
all the minors are determinants of
-dimensional matrix. Hence the inverse matrix
can be expressed as

Properties of the inverse matrix
Inverse matrix of several special matrices
- Inverse matrix of identity matrix is itself.
is
.
- Inverse matrix of diagonal matrix is also diagonal.
,
:
.
Further useful properties of the inverse matrix are

Linear systems of equations
The linear system of equations consisting of
unknowns
and
equations can be given as

This can be also called as

linear system of equations. By introducing the

matrix

and the

column vector

and the

column vector

as
![{\displaystyle {\bf {A}}=[\ a_{ij}]\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/59d5a09497c72e293bdef864a49c188dc7850c6c)


the system of linear equations can be rewritten in matrix vector form as

Here

is the coefficient matrix,

is a column vector of unknowns or unknown column vector and

is given column vector.
Reformulation - linear combination of column vectors of matrix
The above system of linear equations can be rewritten by the help of the group operator
on index
as
![{\displaystyle \ =[\ b_{i}]\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/24c1427c1ec7c4723e0e65d447d0b568448e86e4)
where
is a vector
due to the application of the group operator
. The order of grouping on index
and summing can be exchanged on the left hand side of the equation, which gives
![{\displaystyle \sum _{j=1}^{n}[\ a_{ij}]\ x_{j}=[\ b_{i}]\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cba9bcf9f9c99bb3ece189adb0dd05bdcbac3d28)
Introducing the column vectors composed from the columns of
as

the relation can be further rearranged as

Based on this formulation, the problem of solving the above
linear system of equations can be seen as finding the weights
-s,
with which the given vector
can be composed as the linear combination of the
- dimensional vectors
,
.
This interpretation enables to establish criteria for solvability for different cases of the
linear system of equations depending on the relation between
and
as well as the linear independence of the vectors
,
and
.
Solvability - general case
The following statements follow from the above linear combination interpretation for the solvability of the
linear system of equations.
- The linear system of equations is solvable if and only if vector
can be composed as linear combination of the vectors
,
, in other words if the vectors
,
and vector
are linear dependent. This means that adding vector
to the set of vectors
,
does not increase the number of linear independent vectors in the set. This is equivalent to the statement that the extended matrix
obtained as adding column vector
as
column to matrix
, has the same rank as matrix
.
- If
then two cases must be distinguished according to the relation between
and
.
- If
then the vectors
,
generate the
-dimensional space, in which the weights of the
-dimensional vector
is unique, therefore the system has a unique solution. Note that in this case
must be, since
can not be greater than
.
- If
then the vectors
,
generate a
-dimensional space, in which only
components of the the
-dimensional vector
can be composed by linear combination of the the vectors
,
. Therefore in this case
unknowns can be freely selected and the system has infinite many solutions. Note that in this case
, so it can be also smaller than
.
The different cases of solvability and their criteria are summarized in Table 1.
Solvability - square matrix
Several further findings can be obtained for the special case of
square matrix
. If
then matrix
is non-singular and
. Otherwise matrix
is singular and
Taking all these into account the solvability conditions for the square matrix
can be summarized in a slightly simplified form, which is shown in Table 2.
Eigenvectors, eigenvalues, spectral decomposition
For a given
square matrix
finding the scalars
and
column vectors
satisfying the equation

called as eigenvalue problem of matrix
. The scalars
satisfying the equation are called eigenvalues of matrix
and the vectors
satisfying the equation are called eigenvectors of matrix
. Transforming an eigenvector by multiplying it by
gives a vector
being parallel to the considered eigenvector. Hence the transformation by
does not change the direction of the eigenvectors and their length will be multiplied by the eigenvalue. This explains the names eigenvectors and eigenvalues.
Characteristic polynomial
Rearranging the above relation gives

This is homogeneous system of linear equations, which has non-trivial solution only if the determinant of the coefficient matrix
is zero, in other words, if

Observe that
arises in each element of the main diagonal of matrix
. It follows that the permutation giving the products of the main diagonal and therefore also the determinant of this matrix is an
-order polynomial of
. This
-order polynomial of
is called the characteristic polynomial of matrix
. Its solutions give the eigenvalues. According to the fundamental theory of algebra, an
-order polynomial has exactly
solutions in the set of complex numbers. It follows that there exist
complex eigenvalues implying at most
real eigenvalues, from which some of them can arise more times.
Algebraic multiplicity and geometric multiplicity
Let
,
,
denote the eigenvalues of matrix
. The number of arising of the eigenvalue
in the solution of the characteristic polynomial is called the algebraic multiplicity
. For each
the homogeneous system

determines the eigenvectors belonging to the eigenvalue

. Due to the homogeneous character of the system, its solution must have at least one free parameter. This is manifested in the fact that any constant multiplication of an eigenvector is also a solution of a system. Hence an eigenvector is determined only up to a constant multiplication, i.e. only the direction (in the

-dimensional Euclidean system) of the eigenvector is determined, but its length not. The number of linearly independent eigenvectors belonging to the eigenvalue

can be one or more depending on the rank of matrix

. In fact linearly independent eigenvectors belonging to the eigenvalue

is the number of freely selectable parameter, which is

, which is called the geometric multiplicity of

.
An example for determining eigenvalues an eigenvectors
Matrix
is given as

- Computing the eigenvalues of the matrix




- Determining the eigenvectors of the matrix
- The eigenvector belonging to





- The eigenvector corresponding to




or
, since the eigenvectors are only determined up to a multiplicative constant.
Left and right eigenvalues and eigenvectors
So far we considered the form of the eigenvalue problem, in which the vector
being a column vector and hence locates on the right side of matrix
. From this reason this eigenvalue problem is also called as right eigenvalue problem, and the
-s and
-s are also called as right eigenvalues and right eigenvectors of matrix
, respectively.
There is a similar eigenvalue problem, in which the vector
is a row vector and arises on the left hand side of the matrix
as

and it called as left eigenvalue problem. The solution
-s and
-s are also called as left eigenvalues and left eigenvectors of matrix
, respectively.
The two kind of eigenvalue problem can be related by transposing them into each other. For example transposing the right eigenvalue problem gives

This shows the relation between the two kind of eigenvalue problem, which can be formulated as follows. The right side eigenvalues and eigenvectors of square matrix
are the left side eigenvalues and eigenvectors of the transpose of matrix
, respectively. It follows that for a symmetric matrix
the right and the left side eigenvalues and eigenvectors are the same, respectively, due to
S
.
Properties of eigenvalues
Let
and
stands for the eigenvalues and the
-th eigenvalue,
of
matrix
, respectively. Useful properties of the eigenvalues are given as

Spectral decomposition
The spectral decomposition of
square matrix
is a factorization into the canonical form as

where
is a diagonal matrix. The spectral decomposition is also called eigendecomposition or diagonalisation.
The matrices arising in the factorization can be interpreted as follows. The diagonal elements of the diagonal matrix
are the right eigenvalues of matrix
. Note that each eigenvalue arises in the diagonal as many times as its algebraic multiplicity. The column vectors of matrix
are the right eigenvectors
belonging to the
eigenvalues,
. Thus
and
satisfies

which can be arranged in matrix form as

Note that the fact that
is determined only up to a constant, does not influence the canonical form of the spectral decomposition. This can be seen by replacing
by
, in matrix form
by
with
, in the canonical form, which leads to

where we utilized the commutativity of the diagonal matrices
.
An example for spectral decomposition
Continuing the the example for determining eigenvalues an eigenvectors of the matrix

now we show the spectral decomposition of matrix

.
So far we have determined the right eigenvalues
and
, and the right eigenvectors
and
belonging to them.
Composition of diagonal matrix
The diagonal elements of matrix
are the eigenvalues
and
. Thus matrix
is given by

Composition of matrix
The columns of matrix
are the eigenvectors
and
. Thus matrix
is given as

Determination of the inverse of matrix 


The spectral decomposition of matrix 

Conditions for the existence of spectral decomposition
Note that not every square matrix is diagonalisable. Two useful conditions for
matrix
to be diagonalisable can be given as
- The spectral decomposition of matrix
exists if and only if the number of linearly independent right eigenvectors of
equals to
.
- A sufficient condition for the existence of spectral decomposition of matrix
is that its characteristic polynomial has no repeated roots.
Based on the first condition, the canonical form of the spectral decomposition can be derived. The column vectors of
are linear independent, so it is non-singular and hence exists the inverse matrix
. Multiplying the matrix form relation
by
from right gives

from which

falls out resulting the canonical form of the spectral decomposition.
Application of spectral decomposition
Spectral decomposition is used im many area of the natural sciences. Below we list several typical example for using spectral decomposition.
- Calculating a power of matrix

- Expressing the inverse of matrix

Such expression of the inverse of matrix
exists if and only if
for every
.
- Expressing matrix function
of matrix
.
Here
defined by the help of power series of f()

- Showing the convergence of

Matrix norms
Eigenvalues can be considered to be a set of numbers characterizing the magnitude of a square matrix. Often the eigenvalue with the highest magnitude has outstanding importance. Besides of it, there is a need for characterising the magnitude of a square matrix only by one number, analogously to the absolute value of a real or complex number.
The matrix norm is a measure characterizing the magnitude of a square matrix by one number. The matrix norm is denoted by
, e.g. the norm of the square matrix
is
.
Definition of several matrix norms
On the contrary to the uniqueness of the absolute value, more matrix norms can be defined. Let
be the
-th element of the
square matrix
. Then several martix norms can be given as
- Max norm

Note that there is another, similar norm which is also called as max norm.
- Max row sum

norm 
General properties of matrix norms
Matrix norms are useful tools e.g. in theoretical derivations, like e.g. to prove the existence of an upper limit. In fact in most of the cases the concrete form of the matrix norm is not interesting. Instead some properties of the matrix norm are utilized. These important general properties of matrix norm are listed for
square matrices
and
as

Useful matrix norms also have the following additional property

.
All the three above defined matrix norms have this additional property. Let
and
be the
-th and
-th element of the
square matrix
and
, respectively . Then the additional property can be shown for these matrix norms as follows.
- Max norm

- Max row sum

norm 
Several typical applications of matrix norms are given below.
- To show that a norm of a matrix expression is upper limited.
- To show that
, when
.
- To show that
convergent is, when
.
Application of linear algebra
Solving system of equations
System of linear equations
Let us consider the system of linear equations with
unknowns and
equations, i.e. with
square matrix
. This system of linear equations can be written in matrix vector form as

Inhomogeneous system with
Let us consider the case with
, which is an important subclass of system of linear equations. Then there is always a unique solution, since the rank can not be changed via extending
to
, due to
. The solution can be given in closed form by multiplying the matrix vector equation by
from left, which gives the closed form solution as

Note that the inverse matrix exists in this case, since matrix
is non-singular due to
.
Homogeneous system with
Recall that the equation

holds for the matrix

. If

then this leads to

Let
denotes the
-th column vector of
,
. It follows that any column vector
is a solution of the homogeneous equation
with
. If additionally
, then there is only one free parameter. Since any constant multiplication of a column vector
is also a solution, it follows that
,
gives every solution of the homogeneous equation
with
and
- the columns of
are constant multiplications of each other if
.
System of non-linear equations
Let us consider the following system of non-linear equations with
unknowns and
equations:

Introducing the notations


this can be rewritten as

Solving of this system of non-linear equations is equivalent with minimization problem

since the square of each element in
is non-negative. The gradient of
can be expressed as

where
is the Jacobian matrix of
. Then the minimization problem can be solved numerically by applying the gradient descent algorithm. This can be executed by initializing
as

and performing the iterative steps

where
is the step size and can be determined by any of the line search methods (see in 3.2.3).
Numeric optimization
Gradient methods
The gradient methods are numeric methods for unconstrained mathematical optimization. They can be applied to differentiable multivariate functions. They are usually formulated as finding the minimum of a multivariate function. All of them use the gradient of the multivariate function, this explains the name.
The basic idea of gradient methods is to decrease the function value in each step of the algorithm. It is well known that at the given point
of the multivariate function,
x
, the infinitesimal difference in the function value is the highest in the direction of the gradient at that point,
. Therefore the steepest decent is in the opposite direction of the gradient. The iterative step of any gradient methods takes a step in a direction
which is related to the negative gradient at the current point,
. The step size is denoted by
. Hence the iterative step of any gradient methods can be given as

Note that both the step size
and the vector
can depend on the iteration step.
Requirements and conditions to decrease the function value in each iteration
In order to decrease the function value in each step, two requirements must be made.
It must be ensured that the
infinitesimally shows into a descent direction.
The step size
must be enough small to avoid overshooting the nearest local minimum and thus to avoid leading to divergence.
The first requirement is equivalent with showing that the negative gradient
and
and vector
lie on the same side of the hyperplane being perpendicular to the gradient. In this case their scalar product is positive. This leads to the condition
ensuring
as

The second requirement can be fulfilled by formalising "enough small" by specifying a demanded amount of decrease in function value. This is done by the Armijo condition demanding that the decrease in function value should be at least proportional to the product of the step size
and the directional derivative
. Hence the
can be fulfilled by condition
, which is the Armijo condition and given as

where
is the constant of proportionality. In the practice usually
is set to small values, like e.g.
. Observe that on both sides of this inequality there are negative values.
Determination of the step size
If the step size were set to too small then the convergence would become too slow. This leads to a third requirement as
The step size
must be enough large to avoid slowing the convergence. The requirements
and
together ensure an optimal step size.
The three most important search methods for determining the optimal step size are
- Exact line search
- Inexact line search
- Backtracing line search
The exact line search determines the optimal step size
by minimizing the function value after taking the next step. This leads to a one-dimensional minimization problem as

The exact line search are used only seldom in practice due to its high computing effort.
In the inexact line search the step size, starting from an initial value, like e.g.
, is iteratively decreased until the function value after the step will be less than the current one, in other words until

The backtracing line search searches for "enough small" step size by utilizing the Armijo condition. Starting from an initial step size, the step size is iteratively multiplied by a properly selected constant of proportionality,
until the Armijo condition fulfils.
While exact line search fulfils both requirements
and
, inexact line search fulfils only
and backtracing line search lies somewhere in between. Because of this and its simplicity, it is often used in the practice.
The pseudo-code of an algorithm for gradient method with backtracing line search can be seen in Algorithm 9.
Algorithm 9 Gradient method - with backtracing line search
—————————————————————————————
Inputs: - multivariate function 
- initial value 
- precision value 
Outputs:
- found local minimum place 
- local minimum 
—————————————————————————————
1 Initialisation
, 
2 while 
3

4 Compute gradient 
5 Compute vector 
—– Backtracing line search - begin —
6 Init backtracing line search
, set 
7 while 
8

9 end
—– Backtracing line search - end —-
10 Update
as 
11 end
12 return
and 
—————————————————————————————
The gradient descent
The gradient descent algorithm is a special case of the gradient methods by setting
to be the steepest descent, in other words
. Thus gradient descent is a first-order method and its iterative step can be given as

In the context of deep learning the step size
is also called as learning rate. The pseudo-code of the gradient descent algorithm is given in Algorithm 10.
Algorithm 10 Gradient descent algorithm
—————————————————————————————
Inputs: - multivariate function 
- initial value 
- precision value 
Outputs:
- found local minimum place 
- local minimum 
—————————————————————————————
1 Initialisation
, 
2 while 
3

4 Compute gradient 
5 Update learning rate by setting 
6 Update
as 
7 end
8 return
and 
—————————————————————————————
The convergence to local minimum can be guaranteed under certain assumptions on function
and specific choice s of the dependency of learning rate
on
,
,
and
. If function
is convex, then local minimum is also the global minimum.
The extension of gradient descent, the stochastic gradient descent algorithm and its improved variants are the most commonly used numeric optimization algorithms for training deep neural networks.
Further special cases of gradient methods
Several further algorithm can be obtained as special cases of gradient methods by setting vector
specially. These includes
- Diagonal scaled gradient descent and
- Newton’s method.
The diagonal scaled gradient descent algorithm is obtained by setting
, where
is a diagonal matrix. Note that just like vector
, matrix
can also depend on the iteration step.
The Newton’s method, also called as Newton–Raphson method applied to optimization can be also obtained by setting
with
, where
is the Hessian matrix. Since the Hessian matrix includes second order derivatives of function
, Newton’s method is a second order optimization method. For the class of convex functions the Newton’s method converges to the minimum quadratically fast.
Second-order algorithms
The second-order algorithms use the second order information in their original formulation from the Hessian matrix. The following second-order methods are considered
- Newton’s method
- Quasi-Newton methods
- Non-linear conjugate gradient algorithm
Newton methods
Newton’s method for local optimization is based on the second order Taylor-expansion of the function to be optimized. We apply it to minimize the function
. The second order Taylor-expansion of
around
is given as

The method chooses the next value of the parameter vector

as the minimum of the above Taylor-expansion form. It is well known that the necessary condition of

being the minimum is that the gradient of

must be

. Computing the gradient of

with respect to

and setting it to

gives

from which the next value of the parameter vector

can be expressed by setting

as

which is the same formula as we got by setting

in the general iteration relation of the gradient methods. This recursive computational rule utilizes also the second order information in Hessian matrix. If the function to be minimized is locally quadratic then the method jumps in one step to the minimum. Otherwise it iterates in quadratic steps and hence faster than the gradient descent. There are two major issues with the Newton’s method as
- Limitation - The convergence of the method is ensured only for locally convex regions. It can move in wrong direction near a saddle point.
- Drawback - The numerical complexity of the method is
, that is higher than that of the first-order methods. Here
is the number of parameters, i.e. the size of vector
. This is due to the computation of the inverse Hessian in each iteration step.
Quasi-Newton methods
In many applications, like e.g. in machine learning and in deep learning trainings, the the number of parameters,
is usually in magnitude of thousands and millions. Therefore Newton’s method can not be applied directly due to its numerical complexity in such applications.
Instead the quasi-Newton methods can be applied, in which the Hessian matrix is approximated by the matrix
, which needs much less computation effort. The matrix
is chosen to satisfy the so called secant equation

This leads to a significant reduction in the computational complexity.
The most prominent quasi-Newton method is the
Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. It uses the matrix
only for determining the search direction, but it does not apply the quadratic step size like in the original Newton’s method. Instead a line search is performed. If the function is not convex then the step size is determined by finding a point satisfying the Wolfe conditions. The numerical complexity of the BFGS algorithm is
. However the algorithm stores the matrix
in each iteration step leading to the memory need
. This makes it not appropriate for using in DL.
A further improvement of BFGS is the Limited Memory BFGS (L-BFGS), whose memory need is
to
depending on the amount of information used from previous iteration at updating matrix
. The method converges faster in locally convex environment than in a non-convex one.
Non-linear conjugate gradient algorithm
One potential problem of gradient descent algorithm, that in narrow valley it takes a zig-zag route, which slows the convergence. This is because the consecutive gradients are orthogonal and therefore some part of progress already achieved in the previous gradient direction will be undo.
The conjugate gradient algorithm overcomes this problem by selecting the consecutive directions being so called conjugate gradient directions. In case of N-dimensional quadratic function it reaches the minimum exactly after
steps, since it does not undo progress made in previous directions. The next direction is determined as the linear combination of the actual gradient and the actual direction, in which the previous direction is weighted by
. Since the next and actual directions are conjugate gradients, the next direction can be determined on straightforward way by the help of the eigenvectors of the Hessian of the multivariate function to be minimized. The conjugate gradient algorithm was intended to solve a quadratic minimization problem, which is equivalent with solving a system of linear equations.
The non-linear conjugate gradient algorithm is a generalization of the conjugate gradient algorithm to find a minimum of any non-linear multivariate function. In this case the algorithm does not stop after
steps, so a small modification is required in order to restart search eventually in the direction of unaltered gradient. This is performed by periodic reset of
to
. In case of large
, which is the case of e.g. in training deep learning models, the computation of the Hessian matrix requires a high computational effort. Therefore computationally more efficient ways were proposed to compute parameter
. Two of the best known formulas for computing
are given as follows:
- Fletcher-Reeves formula:

- Polak-Ribiere formula:

The pseudo-code of the algorithm is given in Algorithm 11.
Algorithm 11 Non-linear conjugate gradient algorithm
—————————————————————————————
Inputs: - multivariate function 
- initial value 
- precision value 
Outputs:
- found local minimum place 
- local minimum 
—————————————————————————————
1 Initialisation
, 
2 Initialize conjugate direction 
3 while 
4

5 Compute gradient 
6 Update conjugate direction 
7 Perform line search 
8 Update
as 
9 end
10 return
and 
—————————————————————————————
The Quasi-Newton methods converge potentially much faster than the non-linear conjugate gradient algorithm.