This paper is a sequel to the preceding paper  .
The least squares method in Statistics plays an important role in almost all disciplines, from Natural Science to Social Science. When we want to find properties, tendencies or correlations hidden in huge and complicated data we usually employ the method. See for example  .
On the other hand, Deep Learning is the heart of Artificial Intelligence and will become a most important field in Data Science in the near future. As to Deep Learning see for example  -  .
Deep Learning may be stated as a successive learning method based on the least squares method. Therefore, to reconsider it from the view point of Deep Learning is natural and instructive. We carry out the calculation thoroughly of the successive approximation called gradient descent sequence, in which a parameter called learning rate plays an important role.
One of main points is to determine the range of the learning rate, which is a very hard problem  . We showed in  that a difference in methods between Statistics and Deep Learning leads to different results when the learning rate changes.
We generalize the preceding results to the case of the least squares method by polynomial approximation. Our results may give a new insight to both Statistics and Data Science.
2. Least Squares Method
Let us explain the least squares method by polynomial approximation  . The model function is a polynomial in x of degree M given by
For N pieces of two dimensional real data
we assume that their scatter plot is given like Figure 1.
The coefficients of (1)
must be determined by the data set (T denotes the transposition of a vector or a matrix).
For this set of data the error function is given by
Figure 1. Scatter plot.
The aim of least squares method is to minimize the error function (3) with respect to in (2). Usually it is obtained by solving the simultaneous differentiable equations
However, in this paper another approach based on quadratic form is given, which is instructive.
Let us calculate the error function (3). By using the definition of inner product
it is not difficult to see
Here we make an important
Assumption and (full rank).
Let us deform (4). From
we set for simplicity
Namely, we have a general quadratic form
On the other hand, the deformation of (5) is well-known.
Formula For a symmetric and invertible matrix (: ) we have
The proof is easy. Since we obtain
and this gives (6).
Therefore, our case becomes
because is symmetric and invertible by the assumption.
If we choose
then the minimum is given by
where is the N-dimensional identity matrix.
Our method is simple and clear (“smart” in our terminology).
3. Least Squares Method from Deep Learning
In this section we reconsider the least squares method in Section 2 from the view point of Deep Learning.
First we arrange the data in Section 2 like
and consider a simple neuron model in  (see Figure 2).
Here we use the polynomial (1) instead of the sigmoid function .
In this case the square error function becomes
Figure 2. Simple neuron model.
We in general use instead of in (3).
Our aim is also to determine the parameters in order to minimize . However, the procedure is different from the least squares method in Section 2. This is an important and interesting point.
The parameters are determined successively by the gradient descent method (see for example  ): For
and is a small parameter called the learning rate.
The initial value is given appropriately. Pay attention that t is discrete time and T is the transposition.
Let us calculate (11) explicitly. Since
from (12) we have
This equation is easily solved to be
The proof is left to readers.
Since this is not a final form let us continue the calculation. From (14) we have
where is the N-dimensional zero matrix. (15) is just the equation (8) and it is independent of .
Let us evaluate (14) further. The matrix is positive definite, so all eigenvalues are positive. This can be shown as follows. Let us consider the eigenvalue equation
Then we have
Therefore we can arrange all eigenvalues like
Since is symmetric, it is diagonalized as
where Q is an element in ( ) and D is a diagonal matrix
See for example  .
By substituting (17) into (14) and using the equation
we finally obtain
Theorem I A general solution to (14) is
This is our main result.
Next, let us show how to choose the learning rate , which is a very important problem in Deep Learning   .
Let us remember
From (16) and (18) the equations
determine the range of . Noting
Theorem II The learning rate must satisfy an inequality
The greater the value of , the sooner goes the gradient descent (11) so long as the convergence (19) is guaranteed. Let us note that the choice of the initial values is irrelevant when the convergence condition (20) is satisfied.
Comment For example, if we choose like
then we cannot recover (15), which shows a difference in methods between Statistics and Deep Learning.
4. How to Estimate the Learning Rate
How do we calculate ? Since are the eigenvalues of the matrix , they satisfy the equation
where is the characteristic polynomial of given by
This is abstract, so let us deform (21). For simplicity we write as
Then it is easy to see
where the notation is the (real) inner product of vectors.
For clarity let us write down (21) explicitly.
As far as we know there is no viable method to determine the greatest root of if M is very large1. Therefore, let us get satisfied by obtaining an approximate value which is both greater than and easy to calculate.
For the purpose the Gerschgorin’s theorem is very useful2. Let be an complex (real in our case) matrix, and we set
for each i. This is a closed disc centered at with radius called the Gerschgorin’s disc.
Theorem (Gerschgorin  ) For any eigenvalue of A we have
The proof is simple. See for example  .
Our case is real and and
Therefore, all eigenvalues satisfy
where is a closed interval and
If we define
then it is easy to see
Thus we arrive at an admissible value of the learning rate which is easily obtained.
Theorem III An admissible value of is
Let us show an example in the case of (  ), which is very instructive for non-experts.
Example In this case it is easy to see and we set
for simplicity. Moreover, we may assume . Then from (21) we have
On the other hand, from (27) we have
Then it is easy to show
To check this inequality is left to readers. Therefore, from (28) the admissible value becomes
We emphasize once more that is easy to evaluate, while to calculate is very hard if M is large.
5. Concluding Remarks
In this paper we have discussed the least squares method by polynomial approximation from the view point of Deep Learning and carried out calculation of the gradient descent thoroughly. A difference in methods between Statistics and Deep Learning delivers different results when the learning rate is changed. Theorem III is the first result to provide an admissible value of as far as we know.
Deep Learning plays an essential role in Data Science and maybe in almost all fields of Science. Therefore it is desirable for undergraduates to master it in the early stages. To master it they must study Calculus, Linear Algebra and Statistics from Mathematics. My textbook  is recommended.
We wishes to thank Ryu Sasaki for useful suggestions and comments.
1 is not a sparse matrix.
2In my opinion this theorem is not so popular. Why?