The least squares method in Statistics plays an important role in almost all disciplines, from Natural Science to Social Science. When we want to find properties, tendencies or correlations hidden in huge and complicated data we usually employ the method. See for example  .
On the other hand, Deep Learning is the heart of Artificial Intelligence and will become a most important field in Data Science in the near future. As to Deep Learning see for example      .
Deep Learning may be stated as a successive learning method based on the least squares method. Therefore, to reconsider it from the view point of Deep Learning is very natural and we carry out the calculation thoroughly of the successive approximation called gradient descent sequence.
When the learning rate changes a difference in method between Statistics and Deep Learning gives different results.
Theorem I and II in the text are our main results and a related problem (exercise) is presented for readers. Our results may give a new insight to both Statistics and Data Science.
First of all let us explain the least squares method for readers in a very simple setting. For n pieces of two dimensional real data
we assume that their scatter plot is like Figure 1
Then a model function is linear
For this function the error (or loss) function is defined by
The aim of least squares method is to minimize the error function (2) with respect to . A little calculation gives
Then the equations for the stationality
give a linear equation for a and b
Figure 1. Scatter plot 1.
and its solution is given by
Explicitly, we have
To check that a and b give the minimum of (2) is a good exercise.
Note We have an inequality
and the equal sign holds if and only if
Since are data we may assume that for some . Therefore
2. Least Squares Method from Deep Learning
In this section we reconsider the least squares method in Section 1 from the view point of Deep Learning.
First we arrange the data in Section 1 like
and consider a simple neuron model in  (see Figure 2)
Figure 2. Simple neuron model 1.
Here we use the linear function (1) instead of the sigmoid function .
In this case the square error function becomes
We usally use instead of in (2).
Our aim is also to determine the parameters in order to minimize . However, the procedure is different from the least squares method in Section 1. This is an important and interesting point.
For later use let us perform a little calculation
We determine the parameters successively by the gradient descent method, see for example  . For
and is small enough. The initial value is given appropriately. As will be shown shortly in Theorem I, their explicit values are not important.
Comment The parameter ò is called the learning rate and it is very hard to choose ò properly as emphasized in  . In this paper we provide an estimation (see Theorem II).
Let us write down (10) explicitly:
These are cast in a vector-matrix form
For simplicity by setting
we have a simple equation
where E is a unit matrix. Due to (8) the matrix A is invertible ( ), that is, we exclude the trivial and uninteresting case .
The solution is easy and given by
Note Let us consider a simple difference equation
for . Then, the solution is given by
Comment The solution (13) gives
where O is a zero matrix. (14) is just the equation (5).
Let us evaluate (13) further. For the purpose we make some preparations from Linear Algebra  . For simplicity we set
and want to diagonalize A.
The characteristic polynomial of A is
and the solutions are given by
It is easy to see
We set the two eigenvectors of matrix A, corresponding to and , in a matrix form
It is easy to see
from (16) and we also set
Then it is easy to see
Namely, Q is an orthogonal matrix. Then the diagonalization of A becomes
By substituting (19) into (13) and using
we finally obtain
Theorem I A general solution to (12) is
This is our main result.
Lastly, let us show how to choose the learning rate , which is a very important problem in Deep Learning. Let us remember
from (17). From (15) the equations
determine the range of ò. Noting
Theorem II The learning rate ò must satisfy an inequality
From (21) ò becomes very small when is large enough. It is easy to see that the second condition is automatically satisfied.
Under Theorem II we can recover (14)
Comment For example, if we choose ò like
then we cannot recover (14), which shows a difference between Statistics and Deep Learning. Let us emphasize that the choice of the initial values is irrelevant when the convergence condition (21) is satisfied.
As a result, how to choose ò properly in Deep Learning becomes a very important problem when the number of data is huge. As far as we know the result like Theorem II has not been obtained.
In this section we present the outline of a simple generalization of the results in Section 2. The actual calculation is left as a problem (exercise) to readers.
For n pieces of three dimensional real data
we assume that its scatter plot is like Figure 3
Then a model function is linear
and the error (or loss) function is defined by
Figure 3. Scatter plot 2.
Figure 4. Simple neuron model 2.
The aim of least squares method is to minimize the error function (22) with respect to .
As we want to treat the least squares method above from the view point of Deep Learning, we again arrange the data like
and consider another simple neuron model (see Figure 4)
Then we present
Problem Carry out the corresponding calculation as given in Section 2.
4. Concluding Remarks
In this paper we discussed the least squares method from the view point of Deep Learning and carried out calculation of the gradient descent thoroughly. A difference in methods between Statistics and Deep Learning delivers different results when the learning rate ò is changed. The result of Theorem II is the first one as far as we know.
Deep Learning plays an essential role in Data Science and maybe in almost all fields of Science. Therefore it is desirable for undergraduates to master it as soon as possible. To master it they must study Calculus, Linear Algebra and Statistics from Mathematics. However we don’t know a good and compact textbook leading to Deep Learning.
I am planning to write a comprehensive textbook in the near future  .
We wish to thank Ryu Sasaki for useful suggestions and comments.