Mathematical Reinforcement to the Minibatch of Deep Learning
Author(s) Kazuyuki Fujii*
ABSTRACT
We elucidate a practical method in Deep Learning called the minibatch which is very useful to avoid local minima. The mathematical structure of this method is, however, a bit obscure. We emphasize that a certain condition, which is not explicitly stated in ordinary expositions, is essential for the minibatch method. We present a comprehensive description Deep Learning for non-experts with the mathematical reinforcement.

1. Introduction

Deep Learning is one of Machine Learning methods and it has attracted much attention recently. Leaning in general is divided into the supervised learning and unsupervised learning. In this paper we discuss only supervised learning. For general introduction to Deep Learning, see for example  . The monographs    are standard textbooks.

*Dedicated to the memory of Professor Ichiro Yokota.

Deep Learning is based on big data, so the supervised learning will give a heavy load to the computer. In order to alleviate the burden we usually use a practical method called the minibatch (a collection of randomly selected small data from big data, see Figure 5). Although the method is commonly used it is not widely understood why it is so effective from the mathematical viewpoint. We point out in Theorem II that a certain condition, which is practically satisfied in ordinary applications, is essential for the effectiveness of the minibatch method.

In this paper we present a short and concise review of Deep Learning for non-experts and provide a mathematical reinforcement to the minibatch from the viewpoint of Linear Algebra. Theorem I and II in the text are our main results and some related problems are presented.

After reading this paper, readers are encouraged to tackle a remarkable paper  which is definitely a monumental achievement.

2. Simple Neural Network

As a general introduction to Neuron Dynamics see for example   .

For non-experts let us draw a neuron model based on the step function $S\left(x\right)$ . In Figure 1 the set $\left\{{x}_{1},{x}_{2},\cdots ,{x}_{n}\right\}$ is the input data, $z=S\left(y\right)$ is the output data and the set $\left\{{w}_{1},{w}_{2},\cdots ,{w}_{n}\right\}$ is the weights of synaptic connections. The parameter q is the threshold of the neuron.

Here, $z=S\left(x\right)$ is a simple step function defined by

$S\left(x\right)=\left\{\begin{array}{l}0\text{ }x<0\hfill \\ 1\text{ }\text{ }x\ge 0\hfill \end{array}$

Since an algorithm called the backpropagation in Neural Network System uses derivatives of some activation functions, the simple step function $S\left(x\right)$ is not suitable. Therefore, we usually use a function called the sigmoid function instead of the step function (see Figure 2). This function is used for a nonlinear compression of data.

Figure 1. Simple neuron model.

Figure 2. Sigmoid function.

Definition The sigmoid function is defined by

$\sigma \left(x\right)=\frac{1}{1+{\text{e}}^{-x}}\text{ }\left(x\in R\right).$ (1)

It is easy to see that the function satisfies

$\sigma \left(-\infty \right)=0,\text{\hspace{0.17em}}\sigma \left(0\right)=\frac{1}{2},\text{\hspace{0.17em}}\sigma \left(\infty \right)=1$

and its derivatives are expressed as polynomials of the original function

$\begin{array}{l}{\sigma }^{\prime }\left(x\right)=\sigma \left(x\right)\left(1-\sigma \left(x\right)\right),\\ {\sigma }^{″}\left(x\right)=\sigma \left(x\right)\left(1-\sigma \left(x\right)\right)\left(1-2\sigma \left(x\right)\right).\end{array}$ (2)

The graph is as follows.

Comment In Mathematics this is a famous function called the standard logistic function, while in Deep Learning this is called the sigmoid function.

A general logistic function $f\left(x\right)$ has three parameters $\left\{L,\lambda ,{x}_{0}\right\}$ , the overall height L, the parameter l specifying the inverse width of the transient region and the location of the center ${x}_{0}$ :

$f\left(x\right)=\frac{L}{1+{\text{e}}^{-\lambda \left(x-{x}_{0}\right)}}.$

The standard logistic function is characterized by $L=1,\lambda =1,{x}_{0}=0$ . See a lecture note  as to why the function is so important.

Let us draw a neuron model based on the sigmoid function, See Figure 3.

The most important thing is to improve the set of weights of the synoptic connections $\left\{{w}_{1},{w}_{2},\cdots ,{w}_{n}\right\}$ by hard learning. For the purpose we prepare m input data and teacher signals. In the following we assume $m and $\theta =0$ for simplicity. This causes no problem for the present explanation. There is a suitable value of q for each specific application.

For $j=1,2,\cdots ,m$ we set

$\text{Input}\text{\hspace{0.17em}}\text{data}:{x}^{\left(j\right)}={\left({x}_{1}^{\left(j\right)},{x}_{2}^{\left(j\right)},\cdots ,{x}_{n}^{\left(j\right)}\right)}^{\text{T}},$

$\text{Teacher}\text{\hspace{0.17em}}\text{signal}:{d}^{\left(j\right)}$

where T denotes the transpose of a vector (or a matrix).

Figure 3. Simple neuron model modified.

Our method is to determine the weights of synaptic connections which match the output data with given teacher signals as much as possible. For the purpose we consider a square error like

$E=E\left(w\right)=\frac{1}{2}\underset{j=1}{\overset{m}{\sum }}{\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\right)}^{2},$ (3)

${z}^{\left(j\right)}=\sigma \left({y}^{\left(j\right)}\right)=\sigma \left(\underset{k=1}{\overset{n}{\sum }}\text{ }{w}_{k}{x}_{k}^{\left(j\right)}\right)$ (4)

and determine $w={\left({w}_{1},{w}_{2},\cdots ,{w}_{n}\right)}^{\text{T}}$ so that $E\left(w\right)$ is minimized.

We want to realize this purpose by repeated learning. For that we consider a discrete dynamical system. Namely, we set

$E=E\left(w\left(t\right)\right)=\frac{1}{2}\underset{j=1}{\overset{m}{\sum }}{\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\left(t\right)\right)}^{2};\text{\hspace{0.17em}}{z}^{\left(j\right)}\left(t\right)=\sigma \left({y}^{\left(j\right)}\left(t\right)\right)=\sigma \left(\underset{k=1}{\overset{n}{\sum }}\text{ }{w}_{k}\left(t\right){x}_{k}^{\left(j\right)}\right)$ (5)

for $t=0,1,2,\cdots ,T\text{\hspace{0.17em}}\left(\in N\right)$ .

The initial value $w\left(0\right)$ is given appropriately (not so important).

For a general introduction to Gradient Descent see for example  .

The gradient descent is performed by changing ${w}_{k}\left(t\right)$ to

${w}_{k}\left(t+1\right)={w}_{k}\left(t\right)-ϵ\frac{\partial E\left(w\left(t\right)\right)}{\partial {w}_{k}\left(t\right)}\text{ }\left(0<ϵ<1\right)$ (6)

and we evaluate the error with the renewed weights

$E=E\left(w\left(t+1\right)\right).$

This renewal of the weights is the learning process. This is very popular in Mathematics1. We repeat this procedure T steps which is large enough like

$E\left(w\left(0\right)\right)⇒E\left(w\left(1\right)\right)⇒\cdots ⇒E\left(w\left(T\right)\right)\equiv E\left(w\right).$

Let us proceed. We set $E=E\left(w\right)$ for simplicity (t omitted). From (5) simple calculation gives

$\begin{array}{c}\frac{\partial E}{\partial {w}_{k}}=\underset{j=1}{\overset{m}{\sum }}\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\right)\left(-\frac{\partial {z}^{\left(j\right)}}{\partial {w}_{k}}\right)\\ =-\underset{j=1}{\overset{m}{\sum }}\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\right)\sigma \left(\underset{k=1}{\overset{n}{\sum }}\text{ }{w}_{k}{x}_{k}^{\left(j\right)}\right)\left\{1-\sigma \left(\underset{k=1}{\overset{n}{\sum }}\text{ }{w}_{k}{x}_{k}^{\left(j\right)}\right)\right\}{x}_{k}^{\left(j\right)}\\ =-\underset{j=1}{\overset{m}{\sum }}\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\right){x}_{k}^{\left(j\right)}\sigma \left(\underset{k=1}{\overset{n}{\sum }}\text{ }{w}_{k}{x}_{k}^{\left(j\right)}\right)\left\{1-\sigma \left(\underset{k=1}{\overset{n}{\sum }}\text{ }{w}_{k}{x}_{k}^{\left(j\right)}\right)\right\}\end{array}$

and from (6) we obtain

$\begin{array}{c}{w}_{k}\left(t+1\right)={w}_{k}\left(t\right)-ϵ\frac{\partial E\left(w\left(t\right)\right)}{\partial {w}_{k}\left(t\right)}\\ ={w}_{k}\left(t\right)+ϵ\underset{j=1}{\overset{m}{\sum }}\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\left(t\right)\right){x}_{k}^{\left(j\right)}\sigma \left(\underset{k=1}{\overset{n}{\sum }}\text{ }{w}_{k}\left(t\right){x}_{k}^{\left(j\right)}\right)\left\{1-\sigma \left(\underset{k=1}{\overset{n}{\sum }}\text{ }{w}_{k}\left(t\right){x}_{k}^{\left(j\right)}\right)\right\}.\end{array}$

By substituting this into (5) we obtain

${z}^{\left(j\right)}\left(t+1\right)=\sigma \left(\underset{k=1}{\overset{n}{\sum }}\text{ }{w}_{k}\left(t+1\right){x}_{k}^{\left(j\right)}\right)$

and the square error becomes

$E=E\left(w\left(t+1\right)\right)=\frac{1}{2}\underset{j=1}{\overset{m}{\sum }}{\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\left(t+1\right)\right)}^{2},$

which is complicated enough. The renewal of the weights terminates when all $\partial E\left(w\right)/\partial {w}_{k}=0$ , which is a critical point as shown in the next subsection.

2.2. Number of Critical Points

Here we evaluate the number of critical points to be encountered in the process. First of all, let us show the definition of a critical point with a simple example.

Definition Let $f=f\left(x,y,z\right)$ be a smooth function. Then a critical (or stationary) point $\left({x}_{0},{y}_{0},{z}_{0}\right)$ is defined by

$\left\{\begin{array}{l}{f}_{x}\left({x}_{0},{y}_{0},{z}_{0}\right)=0,\hfill \\ {f}_{y}\left({x}_{0},{y}_{0},{z}_{0}\right)=0,\hfill \\ {f}_{z}\left({x}_{0},{y}_{0},{z}_{0}\right)=0.\hfill \end{array}$ (7)

A critical point is not necessarily a point giving local minimum or local maximum.

We make an important assumption with respect to the input data.

Assumption (good data) For $m

$\left\{{x}^{\left(1\right)},{x}^{\left(2\right)},\cdots ,{x}^{\left(m\right)}\right\}$ (8)

are linearly independent.

The assumption tells that there is a set of integers ${k}_{1}<{k}_{2}<\cdots <{k}_{m}$ satisfying

$|\begin{array}{cccc}{x}_{{k}_{1}}^{\left(1\right)}& {x}_{{k}_{1}}^{\left(2\right)}& \cdots & {x}_{{k}_{1}}^{\left(m\right)}\\ {x}_{{k}_{2}}^{\left(1\right)}& {x}_{{k}_{2}}^{\left(2\right)}& \cdots & {x}_{{k}_{2}}^{\left(m\right)}\\ ⋮& ⋮& \ddots & ⋮\\ {x}_{{k}_{m}}^{\left(1\right)}& {x}_{{k}_{m}}^{\left(2\right)}& \cdots & {x}_{{k}_{m}}^{\left(m\right)}\end{array}|\ne 0.$ (9)

Let us recall

$E=\frac{1}{2}\underset{j=1}{\overset{m}{\sum }}{\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\right)}^{2}=\frac{1}{2}\underset{j=1}{\overset{m}{\sum }}{\left({d}^{\left(j\right)}-\sigma \left({y}^{\left(j\right)}\right)\right)}^{2},\text{ }\sigma \left({y}^{\left(j\right)}\right)=\sigma \left(\underset{k=1}{\overset{n}{\sum }}\text{ }{w}_{k}{x}_{k}^{\left(j\right)}\right)$

and try to find if E has a critical point. Suppose E has a critical point at the weights $w={\left\{{w}_{1},\cdots ,{w}_{n}\right\}}^{\text{T}}$ ,

$0=\frac{\partial E}{\partial {w}_{k}}=-\underset{j=1}{\overset{m}{\sum }}\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\right){\sigma }^{\prime }\left({y}^{\left(j\right)}\right){x}_{k}^{\left(j\right)},\text{ }k=1,\cdots ,n.$

Let us set

${A}^{\left(j\right)}=\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\right){\sigma }^{\prime }\left({y}^{\left(j\right)}\right)$

for simplicity. The equation can be written as

$\underset{j=1}{\overset{m}{\sum }}\text{ }{A}^{\left(j\right)}{x}_{k}^{\left(j\right)}=0\text{ }\text{ }\text{for}\text{ }\text{\hspace{0.17em}}k=1,\cdots ,n$

and it implies

$\underset{j=1}{\overset{m}{\sum }}\text{ }{x}_{k}^{\left(j\right)}{A}^{\left(j\right)}=0\text{ }\text{ }\text{for}\text{ }\text{\hspace{0.17em}}k={k}_{1},{k}_{2},\cdots ,{k}_{m}$

or in matrix form

$\left(\begin{array}{cccc}{x}_{{k}_{1}}^{\left(1\right)}& {x}_{{k}_{1}}^{\left(2\right)}& \cdots & {x}_{{k}_{1}}^{\left(m\right)}\\ {x}_{{k}_{2}}^{\left(1\right)}& {x}_{{k}_{2}}^{\left(2\right)}& \cdots & {x}_{{k}_{2}}^{\left(m\right)}\\ ⋮& ⋮& \ddots & ⋮\\ {x}_{{k}_{m}}^{\left(1\right)}& {x}_{{k}_{m}}^{\left(2\right)}& \cdots & {x}_{{k}_{m}}^{\left(m\right)}\end{array}\right)\left(\begin{array}{c}{A}^{\left(1\right)}\\ {A}^{\left(2\right)}\\ ⋮\\ {A}^{\left(m\right)}\end{array}\right)=\left(\begin{array}{c}0\\ 0\\ ⋮\\ 0\end{array}\right).$

Since the determinant of coefficient matrix is non-zero from (9) we have

$\left(\begin{array}{c}{A}^{\left(1\right)}\\ {A}^{\left(2\right)}\\ ⋮\\ {A}^{\left(m\right)}\end{array}\right)=\left(\begin{array}{c}\left({d}^{\left(1\right)}-{z}^{\left(1\right)}\right){\sigma }^{\prime }\left({y}^{\left(1\right)}\right)\\ \left({d}^{\left(2\right)}-{z}^{\left(2\right)}\right){\sigma }^{\prime }\left({y}^{\left(2\right)}\right)\\ ⋮\\ \left({d}^{\left(m\right)}-{z}^{\left(m\right)}\right){\sigma }^{\prime }\left({y}^{\left(m\right)}\right)\end{array}\right)=\left(\begin{array}{c}0\\ 0\\ ⋮\\ 0\end{array}\right)$

and

${d}^{\left(j\right)}-{z}^{\left(j\right)}={d}^{\left(j\right)}-\sigma \left({y}^{\left(j\right)}\right)=0\text{ }\text{ }\text{for}\text{ }\text{\hspace{0.17em}}j=1,2,\cdots ,m$

because ${\sigma }^{\prime }\left(y\right)\ne 0$ . This means

$E=\frac{1}{2}\underset{j=1}{\overset{m}{\sum }}{\left({d}^{\left(j\right)}-{z}^{\left(j\right)}\right)}^{2}=0.$

The result shows that there is no critical point in this process except for $E=0$ . Therefore, when modifying ${w}_{k}\left(t\right)$ by the gradient descent (6) successively there is no danger of being trapped by points taking local minima. Let us summarize the result :

Theorem I Under the assumption (8) there is only one global minimum $E=0$ .

Next, let us explain how to choose linearly independent data. We assume that n is huge and m is sufficiently smaller than n.

For m input data

$\left\{{x}^{\left(1\right)},{x}^{\left(2\right)},\cdots ,{x}^{\left(m\right)}\right\}$

we consider the $m×m$ Gram’s deterninant

$G\left({x}^{\left(1\right)},{x}^{\left(2\right)},\cdots ,{x}^{\left(m\right)}\right)=|\begin{array}{cccc}〈{x}^{\left(1\right)}|{x}^{\left(1\right)}〉& 〈{x}^{\left(1\right)}|{x}^{\left(2\right)}〉& \cdots & 〈{x}^{\left(1\right)}|{x}^{\left(m\right)}〉\\ 〈{x}^{\left(2\right)}|{x}^{\left(1\right)}〉& 〈{x}^{\left(2\right)}|{x}^{\left(2\right)}〉& \cdots & 〈{x}^{\left(2\right)}|{x}^{\left(m\right)}〉\\ ⋮& ⋮& \ddots & ⋮\\ 〈{x}^{\left(m\right)}|{x}^{\left(1\right)}〉& 〈{x}^{\left(m\right)}|{x}^{\left(2\right)}〉& \cdots & 〈{x}^{\left(m\right)}|{x}^{\left(m\right)}〉\end{array}|.$ (10)

If m is not large, evaluation of the Gram’s determinant is not so difficult. It is well-known that the input data are linearly independent if the Gram’s determinant is non-zero (see for example  ).

Let us give a short explanation for non-experts. For simplicity we consider the case of $m=3$ and set

${x}^{\left(1\right)}=a,\text{\hspace{0.17em}}{x}^{\left(2\right)}=b,\text{\hspace{0.17em}}{x}^{\left(3\right)}=c.$

Then

$G\left(a,b,c\right)=|\begin{array}{ccc}〈a|a〉& 〈a|b〉& 〈a|c〉\\ 〈b|a〉& 〈b|b〉& 〈b|c〉\\ 〈c|a〉& 〈c|b〉& 〈c|c〉\end{array}|$

and we assume that $\left\{a,b,c\right\}$ are linearly dependent. For example, we can set

$c=\alpha a+\beta b\text{ }\left(\alpha ,\beta \text{\hspace{0.17em}}\text{ }\text{is}\text{\hspace{0.17em}}\text{real}\right).$

Short calculation after substituting this into the Gram's determinant above gives

$G\left(a,b,c\right)=|\begin{array}{ccc}〈a|a〉& 〈a|b〉& 〈a|c〉\\ 〈b|a〉& 〈b|b〉& 〈b|c〉\\ \alpha 〈a|a〉+\beta 〈b|a〉& \alpha 〈a|b〉+\beta 〈b|b〉& \alpha 〈a|c〉+\beta 〈b|c〉\end{array}|$

and some fundamental operations of determinant give

$G\left(a,b,c\right)=|\begin{array}{ccc}〈a|a〉& 〈a|b〉& 〈a|c〉\\ 〈b|a〉& 〈b|b〉& 〈b|c〉\\ 0& 0& 0\end{array}|=0.$

Namely, we have shown

$\text{ }\left\{a,b,c\right\}\text{\hspace{0.17em}}\text{is}\text{\hspace{0.17em}}\text{linearly}\text{\hspace{0.17em}}\text{dependent}\text{ }\text{\hspace{0.17em}}⇒\text{\hspace{0.17em}}G\left(a,b,c\right)=0.$

Taking the contraposition gives

$G\left(a,b,c\right)\ne 0⇒\left\{a,b,c\right\}\text{\hspace{0.17em}}\text{is}\text{\hspace{0.17em}}\text{linearly}\text{\hspace{0.17em}}\text{independent}\text{ }.$

Comment For $n>m$ we denote by $M\left(n,m;R\right)$ a set of all $n×m$ matrices over R and denote

$\begin{array}{l}{M}_{n,m}\left(R\right)=\left\{\left({x}^{\left(1\right)},{x}^{\left(2\right)},\cdots ,{x}^{\left(m\right)}\right)\in M\left(n,m;R\right)|\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\left\{{x}^{\left(1\right)},{x}^{\left(2\right)},\cdots ,{x}^{\left(m\right)}\right\}\text{\hspace{0.17em}}\text{ }\text{is}\text{\hspace{0.17em}}\text{linearly}\text{\hspace{0.17em}}\text{independent}\right\}.\end{array}$

It is well-known that the subset ${M}_{n,m}\left(R\right)$ is open and dense in $M\left(n,m;R\right)$ . Therefore, if we choose m vectors from Rn randomly they are almost all linearly independent.

3. Deep Neural Network

Deep Learning is the heart of Artificial Intelligence (AI) and we provide a concise review for non-experts. Deep Learning is a deep neural network consisting of the input layer (K components), the (multiplex) intermediate layers ( ${M}_{1},{M}_{2},\cdots ,N$ components and the output layer (N components). See Figure 4 (in which ${M}_{1}=L,{M}_{2}=M,{M}_{3}=N$ ).

3.1. Backpropagation

First, let us show a figure of deep neural network2 (the figure is short in intermediate layers).

Let us explain some notations.

$x={\left({x}_{1},{x}_{2},\cdots ,{x}_{K}\right)}^{\text{T}}$

is the input data. In this case, the weights of synaptic connections are represented by matrices $U,V,W$ given by

$\begin{array}{l}U=\left({u}_{ij}\right)\in M\left(L,K;R\right),\\ V=\left({v}_{ij}\right)\in M\left(M,L;R\right),\\ W=\left({w}_{ij}\right)\in M\left(N,M;R\right)\end{array}$ (11)

and our aim is educating these matrices.

The synaptic connections are expressed as

${r}_{i}=\underset{j=1}{\overset{M}{\sum }}\text{ }{w}_{ij}{\beta }_{j}-\theta ,\text{ }i=1,\cdots ,N$

${z}_{i}=\sigma \left(ri\right)$

and

${q}_{i}=\underset{j=1}{\overset{L}{\sum }}\text{ }{v}_{ij}{\alpha }_{j}-\theta ,\text{ }i=1,\cdots ,M$

${\beta }_{i}=\sigma \left(qi\right)$

and

${p}_{i}=\underset{j=1}{\overset{K}{\sum }}\text{ }{u}_{ij}{x}_{j}-\theta ,\text{ }i=1,\cdots ,L$

${\alpha }_{i}=\sigma \left(pi\right)$

respectively. See Figure 4 once more.

A comment is in order. The flow of data is as follows:

$x=\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ ⋮\\ {x}_{K}\end{array}\right)\text{\hspace{0.17em}}⇒\alpha =\left(\begin{array}{c}{\alpha }_{1}\\ {\alpha }_{2}\\ ⋮\\ {\alpha }_{L}\end{array}\right)\text{\hspace{0.17em}}⇒\beta =\left(\begin{array}{c}{\beta }_{1}\\ {\beta }_{2}\\ ⋮\\ {\beta }_{M}\end{array}\right)\text{\hspace{0.17em}}⇒z=\left(\begin{array}{c}{z}_{1}\\ {z}_{2}\\ ⋮\\ {z}_{N}\end{array}\right).$ (12)

Now, if

$d={\left({d}_{1},{d}_{2},\cdots ,{d}_{N}\right)}^{\text{T}}$

is a set of teacher signals, then the square error becomes

Figure 4. Deep neural network model.

$E=E\left(U,V,W\right)=\frac{1}{2}\underset{k=1}{\overset{N}{\sum }}{\left({d}_{k}-{z}_{k}\right)}^{2}.$

Let us consider (discrete) time evolution of the model. For $t=0,1,2,\cdots$ , we must determine

$t\text{\hspace{0.17em}}\to \text{\hspace{0.17em}}t+1$

$U\left(t\right)\to U\left(t+1\right),$

$V\left(t\right)\to V\left(t+1\right),$

$W\left(t\right)\to W\left(t+1\right),$

$E\left(t\right)\to E\left(t+1\right)$

and we expect

$E\left(t\right)\approx 0$

if t is large enough (the end of Learning !).

Assume now that the system is in the t-th step

$E\left(t\right)=E\left(U\left(t\right),V\left(t\right),W\left(t\right)\right)=\frac{1}{2}\underset{k=1}{\overset{N}{\sum }}{\left({d}_{k}-{z}_{k}\left(t\right)\right)}^{2}.$ (13)

Then the time evolution of the weights of synaptic connections is determined by the gradient descent:

$\begin{array}{l}{w}_{ij}\left(t+1\right)={w}_{ij}\left(t\right)-ϵ\frac{\partial E\left(t\right)}{\partial {w}_{ij}\left(t\right)},\\ {v}_{ij}\left(t+1\right)={v}_{ij}\left(t\right)-ϵ\frac{\partial E\left(t\right)}{\partial {v}_{ij}\left(t\right)},\\ {u}_{ij}\left(t+1\right)={u}_{ij}\left(t\right)-ϵ\frac{\partial E\left(t\right)}{\partial {u}_{ij}\left(t\right)},\end{array}$ (14)

which is a natural generalization of that of Section 2.1.

Before the calculation let us make a comment on the derivatives of a composite function.

Comment Let us explain with a simple example. For a composite function

$x\to f\left(x\right)\to g\left(f\left(x\right)\right)\to E\left(g\left(f\left(x\right)\right)\right)$

the derivative is given by

$\frac{\text{d}E}{\text{d}x}=\frac{\text{d}E\left(g\left(f\left(x\right)\right)\right)}{\text{d}g\left(f\left(x\right)\right)}\frac{\text{d}g\left(f\left(x\right)\right)}{\text{d}f\left(x\right)}\frac{\text{d}f\left(x\right)}{\text{d}x}={E}^{\prime }\left(g\left(f\left(x\right)\right)\right){g}^{\prime }\left(f\left(x\right)\right){f}^{\prime }\left(x\right).$

A composite function is in general complicated, while its derivative is not so complicated.

3.2. Calculation

Let us perform the calculation of (14).

1) Case of $W\left(t\right)=\left({w}_{ij}\left(t\right)\right)$ : From

$\begin{array}{c}\frac{\partial E\left(t\right)}{\partial {w}_{ij}\left(t\right)}=\frac{\partial E\left(t\right)}{\partial {z}_{i}\left(t\right)}\frac{\partial {z}_{i}\left(t\right)}{\partial {r}_{i}\left(t\right)}\frac{\partial {r}_{i}\left(t\right)}{\partial {w}_{ij}\left(t\right)}\\ =-\left({d}_{i}-{z}_{i}\left(t\right)\right){\sigma }^{\prime }\left({r}_{i}\left(t\right)\right){\beta }_{j}\left(t\right)\\ =-\left({d}_{i}-{z}_{i}\left(t\right)\right)\sigma \left({r}_{i}\left(t\right)\right)\left(1-\sigma \left({r}_{i}\left(t\right)\right)\right){\beta }_{j}\left(t\right)\end{array}$

we obtain

${w}_{ij}\left(t+1\right)={w}_{ij}\left(t\right)+ϵ\left({d}_{i}-{z}_{i}\left(t\right)\right)\sigma \left({r}_{i}\left(t\right)\right)\left(1-\sigma \left({r}_{i}\left(t\right)\right)\right){\beta }_{j}\left(t\right).$

2) Case of $V\left(t\right)=\left({v}_{ij}\left(t\right)\right)$ : From

$\begin{array}{c}\frac{\partial E\left(t\right)}{\partial {v}_{ij}\left(t\right)}=\underset{k=1}{\overset{N}{\sum }}\frac{\partial E\left(t\right)}{\partial {z}_{k}\left(t\right)}\frac{\partial {z}_{k}\left(t\right)}{\partial {r}_{k}\left(t\right)}\frac{\partial {r}_{k}\left(t\right)}{\partial {\beta }_{i}\left(t\right)}\frac{\partial {\beta }_{i}\left(t\right)}{\partial {q}_{i}\left(t\right)}\frac{\partial {q}_{i}\left(t\right)}{\partial {v}_{ij}\left(t\right)}\\ =-\underset{k=1}{\overset{N}{\sum }}\left({d}_{k}-{z}_{k}\left(t\right)\right){\sigma }^{\prime }\left({r}_{k}\left(t\right)\right){w}_{ki}\left(t\right){\sigma }^{\prime }\left({q}_{i}\left(t\right)\right){\alpha }_{j}\left(t\right)\\ =-\underset{k=1}{\overset{N}{\sum }}\left({d}_{k}-{z}_{k}\left(t\right)\right)\sigma \left({r}_{k}\left(t\right)\right)\left(1-\sigma \left({r}_{k}\left(t\right)\right)\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}×{w}_{ki}\left(t\right)\sigma \left({q}_{i}\left(t\right)\right)\left(1-\sigma \left({q}_{i}\left(t\right)\right)\right){\alpha }_{j}\left(t\right)\end{array}$

we have

$\begin{array}{c}{v}_{ij}\left(t+1\right)={v}_{ij}\left(t\right)+ϵ\underset{k=1}{\overset{N}{\sum }}\left({d}_{k}-{z}_{k}\left(t\right)\right)\sigma \left({r}_{k}\left(t\right)\right)\left(1-\sigma \left({r}_{k}\left(t\right)\right)\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}×{w}_{ki}\left(t\right)\sigma \left({q}_{i}\left(t\right)\right)\left(1-\sigma \left({q}_{i}\left(t\right)\right)\right){\alpha }_{j}\left(t\right).\end{array}$

3) Case of $U\left(t\right)=\left({u}_{ij}\left(t\right)\right)$ : From

$\begin{array}{c}\frac{\partial E\left(t\right)}{\partial {u}_{ij}\left(t\right)}=\underset{l=1}{\overset{N}{\sum }}\underset{k=1}{\overset{M}{\sum }}\frac{\partial E\left(t\right)}{\partial {z}_{l}\left(t\right)}\frac{\partial {z}_{l}\left(t\right)}{\partial {r}_{l}\left(t\right)}\frac{\partial {r}_{l}\left(t\right)}{\partial {\beta }_{k}\left(t\right)}\frac{\partial {\beta }_{k}\left(t\right)}{\partial {q}_{k}\left(t\right)}\frac{\partial {q}_{k}\left(t\right)}{\partial {\alpha }_{i}\left(t\right)}\frac{\partial {\alpha }_{i}\left(t\right)}{\partial {p}_{i}\left(t\right)}\frac{\partial {p}_{i}\left(t\right)}{\partial {u}_{ij}\left(t\right)}\\ =-\underset{l=1}{\overset{N}{\sum }}\underset{k=1}{\overset{M}{\sum }}\left({d}_{l}-{z}_{l}\left(t\right)\right){\sigma }^{\prime }\left({r}_{l}\left(t\right)\right){w}_{lk}\left(t\right){\sigma }^{\prime }\left({q}_{k}\left(t\right)\right){v}_{ki}\left(t\right){\sigma }^{\prime }\left({p}_{i}\left(t\right)\right){x}_{j}\end{array}$

we have

$\begin{array}{l}{u}_{ij}\left(t+1\right)\\ ={u}_{ij}\left(t\right)+ϵ\underset{l=1}{\overset{N}{\sum }}\underset{k=1}{\overset{M}{\sum }}\left({d}_{l}-{z}_{l}\left(t\right)\right)\sigma \left({r}_{l}\left(t\right)\right)\left(1-\sigma \left({r}_{l}\left(t\right)\right)\right){w}_{lk}\left(t\right)\\ ×\sigma \left({q}_{k}\left(t\right)\right)\left(1-\sigma \left({q}_{k}\left(t\right)\right)\right){v}_{ki}\left(t\right)\sigma \left({p}_{i}\left(t\right)\right)\left(1-\sigma \left({p}_{i}\left(t\right)\right)\right){x}_{j}.\end{array}$

The calculation so far is based on one input data, for simplicity. However, we must treat a lot of data simultaneously. Let us denote by a the number of data to be considered.

For $l=1,2,\cdots ,a$ we set

$\text{Input}\text{\hspace{0.17em}}\text{Data}\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{x}^{\left(l\right)}={\left({x}_{1}^{\left(l\right)},{x}_{2}^{\left(l\right)},\cdots ,{x}_{K}^{\left(l\right)}\right)}^{\text{T}},$

$\text{Teacher}\text{\hspace{0.17em}}\text{Signal}\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}{d}^{\left(l\right)}={\left({d}_{1}^{\left(l\right)},{d}_{2}^{\left(l\right)},\cdots ,{d}_{N}^{\left(l\right)}\right)}^{\text{T}},$

$\text{Output}\text{\hspace{0.17em}}\text{Data}\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{z}^{\left(l\right)}={\left({z}_{1}^{\left(l\right)},{z}_{2}^{\left(l\right)},\cdots ,{z}_{N}^{\left(l\right)}\right)}^{\text{T}}.$

By taking all these into consideration the square error function in (13) changes to

$\stackrel{^}{E}\left(t\right)=\underset{l=1}{\overset{a}{\sum }}\text{ }{E}^{\left(l\right)}\left(t\right)=\underset{l=1}{\overset{a}{\sum }}\left\{\frac{1}{2}\underset{k=1}{\overset{N}{\sum }}{\left({d}_{k}^{\left(l\right)}-{z}_{k}^{\left(l\right)}\left(t\right)\right)}^{2}\right\}$ (15)

at the t-th step.

In this case the calculation is very simple. For example, since

${w}_{ij}\left(t+1\right)={w}_{ij}\left(t\right)-ϵ\frac{\partial \stackrel{^}{E}\left(t\right)}{\partial {w}_{ij}\left(t\right)}={w}_{ij}\left(t\right)-ϵ\underset{l=1}{\overset{a}{\sum }}\frac{\partial {E}^{\left(l\right)}\left(t\right)}{\partial {w}_{ij}\left(t\right)}$

it is only changed to

${w}_{ij}\left(t+1\right)={w}_{ij}\left(t\right)+ϵ\underset{l=1}{\overset{a}{\sum }}\left({d}_{i}^{\left(l\right)}-{z}_{i}^{\left(l\right)}\left(t\right)\right)\sigma \left({r}_{i}^{\left(l\right)}\left(t\right)\right)\left(1-\sigma \left({r}_{i}^{\left(l\right)}\left(t\right)\right)\right){\beta }_{j}^{\left(l\right)}\left(t\right).$

The calculation for both ${v}_{ij}\left(t\right)$ and ${u}_{ij}\left(t\right)$ is quite similar (omitted).

In the following we make a mysterious assumption.

Assumption We assume that the data in the final step of the intermediate layers are linearly independent:

$\begin{array}{l}1\right)\text{\hspace{0.17em}}\text{\hspace{0.17em}}a (16)

See Figure 4 and the flow of data (12).

We can show that there is no critical point in our system except for $\stackrel{^}{E}=0$ . Therefore, when modifying ${u}_{ij}\left(t\right),{v}_{ij}\left(t\right),{w}_{ij}\left(t\right)$ by the gradient descent (14) (replaced by $\stackrel{^}{E}$ ) successively there is no danger of being trapped by points taking local minima.

Proof: The proof is essentially the same as that of Section 2.2. However, the proof is the heart of the paper, so we repeat it.

A critical point is defined by the equations

$\left\{\begin{array}{l}\frac{\partial \stackrel{^}{E}\left(t\right)}{\partial {w}_{ij}\left(t\right)}=0,\hfill \\ \frac{\partial \stackrel{^}{E}\left(t\right)}{\partial {v}_{ij}\left(t\right)}=0,\hfill \\ \frac{\partial \stackrel{^}{E}\left(t\right)}{\partial {u}_{ij}\left(t\right)}=0.\hfill \end{array}$

From (15) we have and set

$\begin{array}{c}0=\frac{\partial \stackrel{^}{E}\left(t\right)}{\partial {w}_{ij}\left(t\right)}=-\underset{l=1}{\overset{a}{\sum }}\left({d}_{i}^{\left(l\right)}-{z}_{i}^{\left(l\right)}\left(t\right)\right){\sigma }^{\prime }\left({r}_{i}^{\left(l\right)}\left(t\right)\right){\beta }_{j}^{\left(l\right)}\left(t\right)\\ =\underset{l=1}{\overset{a}{\sum }}\text{ }{\beta }_{j}^{\left(l\right)}\left(t\right)\left({d}_{i}^{\left(l\right)}-{z}_{i}^{\left(l\right)}\left(t\right)\right){\sigma }^{\prime }\left({r}_{i}^{\left(l\right)}\left(t\right)\right)\text{\hspace{0.17em}}\left(-\text{\hspace{0.17em}}\text{ }\text{is}\text{\hspace{0.17em}}\text{omitted}\right)\text{ }\\ \equiv \underset{l=1}{\overset{a}{\sum }}\text{ }{\beta }_{j}^{\left(l\right)}\left(t\right){A}_{i}^{\left(l\right)}\text{ }\left(j=1,2,\cdots ,M\right)\end{array}$

and it implies

$0=\underset{l=1}{\overset{a}{\sum }}\text{ }{\beta }_{j}^{\left(l\right)}\left(t\right){A}_{i}^{\left(l\right)}\text{ }\left(j={j}_{1}<{j}_{2}<\cdots <{j}_{a}\right).$

Therefore, its matrix form becomes

$\left(\begin{array}{cccc}{\beta }_{{j}_{1}}^{\left(1\right)}& {\beta }_{{j}_{1}}^{\left(2\right)}& \cdots & {\beta }_{{j}_{1}}^{\left(a\right)}\\ {\beta }_{{j}_{2}}^{\left(1\right)}& {\beta }_{{j}_{2}}^{\left(2\right)}& \cdots & {\beta }_{{j}_{2}}^{\left(a\right)}\\ ⋮& ⋮& \ddots & ⋮\\ {\beta }_{{j}_{a}}^{\left(1\right)}& {\beta }_{{j}_{a}}^{\left(2\right)}& \cdots & {\beta }_{{j}_{a}}^{\left(a\right)}\end{array}\right)\left(\begin{array}{c}{A}_{i}^{\left(1\right)}\\ {A}_{i}^{\left(2\right)}\\ ⋮\\ {A}_{i}^{\left(a\right)}\end{array}\right)=\left(\begin{array}{c}0\\ 0\\ ⋮\\ 0\end{array}\right).$

Since the determinant of coefficient matrix is non-zero from (16) we have

$\left(\begin{array}{c}{A}_{i}^{\left(1\right)}\\ {A}_{i}^{\left(2\right)}\\ ⋮\\ {A}_{i}^{\left(a\right)}\end{array}\right)=\left(\begin{array}{c}0\\ 0\\ ⋮\\ 0\end{array}\right)$

and

${A}_{i}^{l}\equiv \left({d}_{i}^{\left(l\right)}-{z}_{i}^{\left(l\right)}\left(t\right)\right){\sigma }^{\prime }\left({r}_{i}^{\left(l\right)}\left(t\right)\right)=0\text{ }\left(l=1,2,\cdots ,a\right)$

and ${\sigma }^{\prime }\left({r}_{i}^{\left(l\right)}\left(t\right)\right)\ne 0$ gives

${d}_{i}^{\left(l\right)}-{z}_{i}^{\left(l\right)}\left(t\right)=0\text{ }\left(l=1,2,\cdots ,a\right).$

Moreover, these equations are independent of $i=1,2,\cdots ,N$ , so that we finally obtain

$\stackrel{^}{E}\left(t\right)=\underset{l=1}{\overset{a}{\sum }}\text{ }{E}^{\left(l\right)}\left(t\right)=\underset{l=1}{\overset{a}{\sum }}\left\{\frac{1}{2}\underset{k=1}{\overset{N}{\sum }}{\left({d}_{k}^{\left(l\right)}-{z}_{k}^{\left(l\right)}\left(t\right)\right)}^{2}\right\}=0.$

Let us summarize the result :

Theorem II Under the assumption (16) there is only one global minimum $\stackrel{^}{E}=0$ .

Let us emphasize that the assumption (16) is “local” because it refers to the linear independence of the data of the layer one step before the last of the flow:

$x=\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ ⋮\\ {x}_{K}\end{array}\right)⇒\alpha =\left(\begin{array}{c}{\alpha }_{1}\\ {\alpha }_{2}\\ ⋮\\ {\alpha }_{L}\end{array}\right)⇒\beta =\left(\begin{array}{c}{\beta }_{1}\\ {\beta }_{2}\\ ⋮\\ {\beta }_{M}\end{array}\right)⇒z=\left(\begin{array}{c}{z}_{1}\\ {z}_{2}\\ ⋮\\ {z}_{N}\end{array}\right).$

A question arises naturally.

Problem I What is the meaning of the assumption in the total flow of data ?

Last in this section let us comment on the minibatch of Deep Learning. When input data A is huge the calculation of time evolution of the weights of synaptic connections will give a heavy load to the computer. In order to alleviate it the minibatch is practical and very useful, see Figure 5.

Figure 5. Minibatch.

Then a question also arises naturally.

Problem II Is there a relation between A and T?

One may conjecture

$T=\left[A/a\right]$

where [k] is the Gauss symbol, which is the greatest integer less than or equal to k. For example, [3.14] = 3. However, I do not believe this is correct.

4. Concluding Remarks

Nowadays, studying Deep Learning is standard for university students in the world. However, it is not so easy for them because Deep Learning requires a lot of knowledge.

In this paper, we treated the minibatch of Deep Learning and gave the mathematical reinforcement to it from the viewpoint of Linear Algebra and presented some related problems. We expect that young researchers will solve the problems in the near future.

As a related topic of Artificial Intelligence, there is Information Retrieval. Since there is no more space, we only refer to some relevant works    .

Acknowledgements

We wish to thank Ryu Sasaki for useful suggestions and comments.

NOTES

1However, to choose ò in (6) correctly is a very hard problem.

2Figure 4 is of course not perfect because drawing a figure of a big size is not easy.

Cite this paper
Fujii, K. (2018) Mathematical Reinforcement to the Minibatch of Deep Learning. Advances in Pure Mathematics, 8, 307-320. doi: 10.4236/apm.2018.83016.
References
   Wikipedia. Deep Learning.
https://en.wikipedia.org/wiki/Deep_learning

   Patterson, J. and Gibson, A. (2017) Deep Learning: A Practitioner’s Approach. O’Reilly Media, Inc., Sebastopol, CA.

   Okaya, T. (2015) Deep Learning. Kodansha Ltd., Tokyo. (In Japanese)

   Mizoguchi, R. and Ishida, R. (2000) Artificial Intelligence. Ohmsha Ltd., Tokyo. (In Japanese)

   Le, Q.V., et al. (2012) Building High-Level Features Using Large Scale Unsupervised Learning. Proceedings of the 29th International Conference on Machine Learning, Edinburgh, June 27th-July 3rd, 2012, 11.

   Aihara, K. and Kanzaki, R. (2008) Theoretical and Engineering Approaches to Brainscience. University of Tokyo Press, Tokyo. (In Japanese)

   Amari, S. (2016) Brain·Heart·Artificial Intelligence. Blue Backs B-1968. Kodansha Ltd., Tokyo. (In Japanese)

   Fujii, K. (2014) Verhulst Model of Population Growth. Lecture Note. Yokohama City University, Yokohama. (In Japanese)

   Wikipedia. Gradient Descent.