Completeness Problem of the Deep Neural Networks

Show more

1. Introduction

Neural networks and deep learning currently provide the best solutions to many supervised learning problems. In 2006, a publication by Hinton, Osindero, and Teh [1] introduced the idea of a “deep” neural network, which first trains a simple supervised model; then adds on a new layer on top and trains the parameters for the new layer alone. You keep adding layers and training layers in this fashion until you have a deep network. Later, this condition of training one layer at a time is removed [2] [3] [4] [5] .

After Hinton’s initial attempt of training one layer at a time, Deep Neural Networks train all layers together. Examples include TensorFlow [6] , Torch [7] , and Theano [8] . Google’s TensorFlow is an open-source software library for dataflow programming across a range of tasks. It is a symbolic math library, and also used for machine learning applications such as neural networks [3] . It is used for both research and production at Google. Torch is an open source machine learning library and a scientific computing framework. Theano is a numerical computation library for Python. The approach using the single training of multiple layers gives advantages to the neural network over other learning algorithms.

One question is the existence of a solution for a given problem. This will often be followed by an effective solution development, i.e. an algorithm for a solution. This will often be followed by the stability of the algorithm. This will often be followed by an efficiency study of solutions. Although these theoretical approaches are not necessary for the empirical development of practical algorithms, the theoretical studies do advance the understanding of the problems. The theoretical studies will prompt new and better algorithm development of practical problems. Along the direction of solution existence, Hornik, Stinchcombe, & White [9] have shown that the multilayer feedforward networks with enough hidden layers are universal approximators. Roux & Bengio [10] have shown the same, Restricted Boltzmann machines are universal approximators of discrete distributions.

Hornik, Stinchcombe, & White [9] establish that the standard multilayer feedforward networks with hidden layers using arbitrary squashing functions are capable of approximating any measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available. In this sense, multilayer feedforward networks are a class of universal approximators.

Deep Belief Networks (DBN) are generative neural network models with many layers of hidden explanatory factors, recently introduced by Hinton, Osindero, and Teh, along with a greedy layer-wise unsupervised learning algorithm. The building block of a DBN is a probabilistic model called a Restricted Boltzmann machine (RBM), used to represent one layer of the model. Restricted Boltzmann machines are interesting because inference is easy in them and because they have been successfully used as building blocks for training deeper models. Roux & Bengio [10] proved that adding hidden units yield a strictly improved modeling power, and RBMs are universal approximators of discrete distributions.

In this paper, we provide yet another proof. The advantage of this proof is that it will lead to several new learning algorithms. We once again prove that Deep Neural Networks are universal approximators. In our approach, Deep Neural Networks implement an expansion and this expansion is complete.

In this paper, a Deep Neural Network (DNN) is an Artificial Neural Network (ANN) with multiple hidden layers between the input and output layers. The organization of this paper is as follows.

In Section 2, we briefly review how to study the completeness problem of Deep Neural Networks (DNN). In this approach, given an input A, an output B, and a mapping from A to B, one can convert this problem to a probability distribution [3] [4] of (A, B): p(a, b), $a\in A$ , $b\in B$ . If an input is $a\in A$ and an output is $b\in B$ , then the probability p(a, b) will be close to 1. One can find a Markov chain [11] such that the equilibrium distribution of this Markov chain, p(a, b), realizes, as faithfully as possible, the given supervised training set.

In Section 3, the Boltzmann machines [3] [4] are briefly reviewed. All possible distributions together form a distribution space. All of the distributions, implemented by Boltzmann machines, define a Boltzmann Distribution Space, which is a subset of the distribution space [12] [13] [14] . Given an unknown function, one can find a Boltzmann machine such that the equilibrium distribution of this Boltzmann machine realizes, as faithfully as possible, the unknown function.

In Section 4, we review the ABM (Attrasoft Boltzmann Machine) [15] which has an invariant distribution. An ABM is defined by two features: 1) an ABM with n neurons has neural connections up to the n^{th} order; and 2) all of the connections up to n^{th} order are determined by the ABM algorithm [15] . By adding more terms in the invariant distribution compared to the second order Boltzmann Machine, ABM is significantly more powerful in simulating an unknown function. Unlike Boltzmann Machine, ABM’s emphasize higher order connections rather than lower order connections. Later, we will discuss the relationships between the higher order connections and DNN.

In Section 5, we review θ-transformation [12] [13] [14] .

In Section 6, we review the completeness of the θ-transformation [12] [13] [14] . The θ-transformation is complete; i.e. given a function, one can find a θ-transformation to convert it from the x-coordinate system to the θ-coordinate system.

In Section 7, we discuss how the invariant distribution of an ABM implements a θ-transformation [12] [13] [14] , i.e. given an unknown function, one can find an ABM such that the equilibrium distribution of this ABM realizes precisely the unknown function. Therefore, an ABM is complete.

The next two sections are the new contributions of this paper. In section 8, we show that we can reduce an ABM to a DNN, i.e. we show that a higher order ANN can be replaced by a lower order ANN by increasing layers. We do not seek an efficient conversion from a higher order ANN to a lower order ANN with more layers. We will merely prove this is possible.

In Section 9, we prove that the DNN is complete, i.e. given an unknown function, one can find a Deep Neural Network that can simulate the unknown function.

2. Basic Approach for Completeness Problem

The goal of this paper is to prove that given any unknown function from A to B, one can find a DNN such that it can simulate this unknown function. It turns out that if we can reduce this from a discrete problem to a continuous problem, it will be very helpful. In this section, we introduce the basic idea of how to study the completeness problem.

The basic supervised learning [2] problem is: given a training set {A, B}, where $A=\left\{{a}_{1},{a}_{2},\cdots \right\}$ and $B=\left\{{b}_{1},{b}_{2},\cdots \right\}$ , find a mapping from A to B. The first step is to convert this problem to a probability [3] [4] :

$p=p\left(a,b\right),a\in A,b\in B.$

If a_{1} matches with b_{1}, the probability is 1 or close to 1. If a_{1} does not match with b_{1}, the probability is 0 or close to 0. This can reduce the problem of inferencing a mapping from A to B, to inferencing a distribution function.

An irreducible finite Markov chain possesses a stationary distribution [11] . This invariant distribution can be used to simulate an unknown function. It is the invariant distribution of the Markov Chain which eventually allows us to prove that the DNN is complete.

3. Boltzmann Machine

A Boltzmann machine [3] [4] is a stochastic neural network in which each neuron has a certain probability to be 1. The probability of a neuron to be 1 is determined by the so called Boltzmann distribution. The collection of the neuron states:

$x=\left({x}_{1},{x}_{2},\cdots ,{x}_{n}\right)$

of a Boltzmann machine is called a configuration. The configuration transition is mathematically described by a Markov chain with 2^{n} configurations
$x\in X$ , where X is the set of all points,
$\left({x}_{1},{x}_{2},\cdots ,{x}_{n}\right)$ When all of the configurations are connected, it forms a Markov chain. A Markov chain has an invariant distribution [11] . Whatever initial configuration a Boltzmann machine starts from, the probability distribution converges over time to the invariant distribution, p(x). The configuration
$x\in X$ appears with a relative frequency p(x) over a long period of time.

The Boltzmann machine [3] [4] defines a Markov chain. Each configuration of the Boltzmann machine is a state of the Markov chain. The Boltzmann machine has a stable distribution. Let T be the parameter space of a family of Boltzmann machines. An unknown function can be considered as a stable distribution of a Boltzmann machine. Given an unknown distribution, a Boltzmann machine can be inferred such that its invariant distribution realizes, as faithfully as possible, the given function. Therefore, an unknown function is transformed into a specification of a Boltzmann machine.

More formally, let F be the set of all functions. Let T be the parameter space of a family of Boltzmann machines. Given an unknown $f\in F$ , one can find a Boltzmann machine such that the equilibrium distribution of this Boltzmann machine realizes, as faithfully as possible, the unknown function [3] [4] . Therefore, the unknown, f, is encoded into a specification of a Boltzmann machine, $t\in T$ . We call the mapping from F to T, a Boltzmann Machine Transformation: $F\to T$ [12] [13] [14] .

Let T be the parameter space of a family of Boltzmann machines, and let F_{T} be the set of all functions that can be inferred by the Boltzmann Machines over T; obviously, F_{T} is a subset of F. It turns out that F_{T} is significantly smaller than F and that it is not a good approximation for F. The main contribution of the Boltzmann Machine is to establish a framework for inferencing a mapping from A to B.

4. Attrasoft Boltzmann Machines (ABM)

The invariant distribution of a Boltzmann machine [3] [4] is:

$p\left(x\right)=b{\text{e}}^{{\displaystyle {\sum}_{i<j}{M}_{ij}{x}_{i}{x}_{j}}}$ (1)

If the threshold vector does not vanish, the distributions are:

$p\left(x\right)=b{\text{e}}^{{\displaystyle {\sum}_{i<j}{M}_{ij}{x}_{i}{x}_{j}}-{\displaystyle \sum {T}_{i}{x}_{i}}}.$ (2)

By rearranging the above distribution, we have:

$p\left(x\right)={\text{e}}^{c-{\displaystyle \sum {T}_{i}{x}_{i}}+{\displaystyle {\sum}_{i<j}{M}_{ij}{x}_{i}{x}_{j}}}$ (3)

It turns out that the third order Boltzmann machines have the following type of distributions:

$p\left(x\right)={\text{e}}^{c-{\displaystyle \sum {T}_{i}{x}_{i}}+{\displaystyle {\sum}_{i<j}{M}_{ij}{x}_{i}{x}_{j}}+{\displaystyle {\sum}_{i<j<k}{M}_{ijk}{x}_{i}{x}_{j}{x}_{k}}}$ (4)

An ABM [12] [13] [14] is an extension of the higher order Boltzmann Machine to the maximum order. An ABM with n neurons has neural connections up to the n^{th} order. All of the connections up to the n^{th} order are determined by the ABM algorithm [15] . By adding additional higher order terms to the invariant distribution, ABM is significantly more powerful in simulating an unknown function.

By adding additional terms, the invariant distribution for an ABM is,

$p\left(x\right)={\text{e}}^{H}$ ,

$H={\theta}_{0}+\sum {\theta}_{1}^{{i}_{1}}{x}_{{{\displaystyle i}}_{1}}+\sum {\theta}_{2}^{{i}_{1}{i}_{2}}{x}_{{i}_{1}}{x}_{{i}_{2}}+\sum {\theta}_{3}^{{i}_{1}{i}_{2}{i}_{3}}{x}_{{i}_{1}}{x}_{{i}_{2}}{x}_{{i}_{3}}+\cdots +{\theta}_{n}^{12\cdots n}{x}_{1}{x}_{2}\cdots $ (5)

ABM is significantly more powerful in simulating an unknown function. As more and more terms are added, from the second order terms to the n^{th} order terms, the invariant distribution space will become larger and larger. Like Boltzmann Machines of the last section, ABM implements a transformation,
${F}_{B}\to T$ . Our goal ultimately that this ABM transformation is complete so that given any function
$f\in F$ , we can find an ABM,
$t\in T$ , such that the equilibrium distribution of this ABM realizes precisely the unknown function. We show that this is exactly the case.

5. θ-Transformation

5.1. Basic Notations

We first introduce some notations used in this paper [12] [13] [14] . There are two different types of coordinate systems: the x-coordinate system and the θ-coordinate system [12] [13] [14] . Each of these two coordinate systems has two representations, x-representation and θ-representation. An N-dimensional vector, p, is:

$p=\left({p}_{0},{p}_{1},\cdots ,{p}_{N-1}\right)$ ,

which is the x-representation of p in the x-coordinate systems.

In the x-coordinate system, there are two representations of a vector:

・ $\left\{{p}_{i}\right\}$ in the x-representation and

・ $\left\{{p}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}\right\}$ in the θ-representation.

In the θ-coordinate system, there are two representations of a vector:

・ $\left\{{\theta}_{i}\right\}$ in the x-representation and

・ $\left\{{\theta}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}\right\}$ in the θ-representation.

The reason for two different representations is that the x-representation is natural for the x-coordinate system, and the θ-representation is natural for the θ-coordinate system.

The transformations between
$\left\{{p}_{i}\right\}$ and
$\left\{{p}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}\right\}$ , and those between
$\left\{{\theta}_{i}\right\}$ and
$\left\{{\theta}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}\right\}$ , are similar. Because of the similarity, in the following, only the transformation between
$\left\{{p}_{i}\right\}$ and
$\left\{{p}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}\right\}$ will be introduced. Let N = 2^{n} be the number of neurons. An N-dimensional vector, p, is:

$p=\left({p}_{0},{p}_{1},\cdots ,{p}_{N-1}\right).$

Consider p_{x}, because
$x\in \left\{0,1,\cdots ,N-1={2}^{n}-1\right\}$ is the position inside a distribution, then x can be rewritten in the binary form:

$x={x}_{n}{2}^{n-1}+\cdots +{x}_{2}{2}^{1}+{x}_{1}{2}^{0}.$

Some of the coefficients x_{i} might be zero. In dropping those coefficients which are zero, we write:

$x={x}_{{i}_{1}}{x}_{{i}_{2}}\cdots {x}_{{i}_{m}}={2}^{{i}_{m}-1}+\cdots +{2}^{{i}_{2}-1}+{2}^{{i}_{1}-1}.$

This generates the following transformation:

${p}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}={p}_{x}={p}_{{2}^{{i}_{m}-1}+\cdots +{2}^{{i}_{2}-1}+{2}^{{i}_{1}-1}}$

where

$1\le {i}_{1}<{i}_{2}<\cdots <{i}_{m}\le n.$ _{ }

In this θ-representation, a vector p looks like:

$\left\{{p}_{0},{p}_{1}^{1},{p}_{1}^{2},{p}_{1}^{3},\cdots ,{p}_{2}^{12},{p}_{2}^{13},{p}_{2}^{23},\cdots ,{p}_{3}^{123},\cdots \right\}$ (6)

The 0-th order term is ${p}_{0}$ , the first order terms are: ${p}_{1}^{1},{p}_{1}^{2},{p}_{1}^{3},\cdots $ , … The first few terms in the transformation between $\left\{{p}_{i}\right\}$ and $\left\{{p}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}\right\}$ are:

$\begin{array}{l}{p}_{0}={p}_{0},{p}_{1}^{1}={p}_{1},{p}_{1}^{2}={p}_{2},\\ {p}_{2}^{12}={p}_{3},{p}_{1}^{3}={p}_{14},{p}_{2}^{13}={p}_{5}\\ {p}_{2}^{23}={p}_{6},{p}_{3}^{123}={p}_{7},{p}_{1}^{4}={p}_{8},\cdots \end{array}$

The x-representation is the normal representation, and the θ-representation is a form of binary representation.

Example Let n = 3, N = 2^{n} = 8, and consider an invariant distribution:

$\left\{{p}_{0},{p}_{1},{p}_{2},{p}_{3},{p}_{4},{p}_{5},{p}_{6},{p}_{7}\right\}$ ,

where p_{0} is the probability of state x = 0,
$\cdots $ . There are 8 probabilities for 8 different states,
$x=\left\{0,1,2,\cdots ,7\right\}$ . In the new representation, it looks like:

$\left\{{p}_{0},{p}_{1}^{1},{p}_{1}^{2},{p}_{1}^{3},{p}_{2}^{12},{p}_{2}^{13},{p}_{2}^{23},{p}_{3}^{123}\right\}$ .

Note that the relative positions of each probability are changed. The first vector, $\left\{{p}_{0},{p}_{1},{p}_{2},{p}_{3},{p}_{4},{p}_{5},{p}_{6},{p}_{7}\right\}$ , is in the x-representation and the second vector $\left\{{p}_{0},{p}_{1}^{1},{p}_{1}^{2},{p}_{1}^{3},{p}_{2}^{12},{p}_{2}^{13},{p}_{2}^{23},{p}_{3}^{123}\right\}$ is in the θ-representation. These two representations are two different expressions of the same vector.

5.2. θ-Transformation

The θ-transformation uses a function F, called a generating function. The function F is required to have the inverse:

$FG=GF=I,G={{\displaystyle F}}^{-1}.$ (7)

Let p be a vector in the x-coordinate system. As already discussed above, it can be written either as:

$p\left(x\right)=\left({p}_{0},{p}_{1},\cdots ,{p}_{N-1}\right)$ (8)

Or

$p\left(x\right)=\left({p}_{0};{p}_{1}^{1},\cdots ,{p}_{1}^{n};{p}_{2}^{12},\cdots ,{p}_{2}^{n-1,n};{p}_{3}^{123},\cdots ,{p}_{n}^{12\cdots n}\right).$ (9)

The θ-transformation transforms a vector from the x-coordinate to the θ-coordinate via a generating function. The components of the vector p in the x-coordinate, p(x), can be converted into components of a vector p(θ) in the θ-coordinate:

$p\left(\theta \right)=\left({\theta}_{0};{\theta}_{1}^{1},\cdots ,{\theta}_{1}^{n};{\theta}_{2}^{12},\cdots ,{\theta}_{2}^{n-1,n};{\theta}_{3}^{123},\cdots ,{\theta}_{n}^{12\cdots n}\right),$ (10)

Or

$p\left(\theta \right)=\left({\theta}_{0},{\theta}_{1},\cdots ,{\theta}_{N-1}\right).$ (11)

Let F be a generating function, which transforms the x-representation of p in the x-coordinate to a θ-representation of p in the θ-coordinate system. The θ-components are determined by the vector F[p(x)] as follows:

$F\left[p\left(x\right)\right]={\theta}_{0}+\sum {\theta}_{1}^{{i}_{1}}{x}_{{i}_{1}}+\sum {\theta}_{2}^{{i}_{1}{i}_{2}}{x}_{{i}_{1}}{x}_{{i}_{2}}+\sum {\theta}_{3}^{{i}_{1}{i}_{2}{i}_{3}}{x}_{{i}_{1}}{x}_{{i}_{2}}{x}_{{i}_{3}}+\cdots +{\theta}_{n}^{12\cdots n}{x}_{1}{x}_{2}\cdots {x}_{n}$ (12)

where

$1\le {i}_{1}<{i}_{2}<\cdots <{i}_{m}\le n.$

Prior to the transformation, p(x) is the x-representation of p in the x-coordinate; after transformation, F[p(x)] is a θ-representation of p in the θ-coordinate system.

There are N components in the x-coordinate and N components in the θ-coordinate. By introducing a new notation X:

$\begin{array}{l}{X}_{0}={X}_{0}=1,{X}_{1}^{2}={X}_{1}={x}_{1},{X}_{1}^{2}={X}_{2}={x}_{2},{X}_{2}^{12}={X}_{3}={x}_{1}{x}_{2},\\ {X}_{1}^{3}={X}_{4}={x}_{3},{X}_{2}^{13}={X}_{5}={x}_{1}{x}_{3},{X}_{2}^{23}={X}_{6}={x}_{2}{x}_{3},\\ {X}_{3}^{123}={X}_{7}={x}_{1}{x}_{2}{x}_{3},\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{X}_{1}^{4}={X}_{8}={x}_{1}{x}_{2}{x}_{3}{x}_{4}\cdots \end{array}$

then the vector can be written as:

$F\left[p\left(x\right)\right]=\sum {\theta}_{J}{X}_{J}$ (13)

By using the assumption GF = I, we have:

$p\left(x\right)=G\left\{\sum {\theta}_{J}{X}_{J}\right\}$ (14)

where J denotes the index in either of the two representations in the θ-coordinate system.

The transformation of a vector p from the x-representation, p(x), in the x-coordinate system to a θ-representation, p(θ), in the θ-coordinate system is called θ-transformation [12] [13] [14] .

The θ-transformation is determined by [12] [13] [14] :

$\begin{array}{c}{\theta}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}=F\left[{p}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}\right]+F\left[{p}_{m-2}^{{i}_{1}\cdots {i}_{m-2}}\right]+\cdots +F\left[{p}_{m-2}^{{i}_{3}\cdots {i}_{m}}\right]+F\left[{p}_{m-4}^{\cdots}\right]+\cdots \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}-F\left[{p}_{m-1}^{{i}_{1}\cdots {i}_{m-1}}\right]-\cdots -F\left[{p}_{m-1}^{{i}_{2}\cdots {i}_{m}}\right]-F\left[{p}_{m-3}^{{i}_{1}\cdots {i}_{m-3}}\right]-\cdots \end{array}$ (15)

The inverse of the θ-transformation [12] [13] [14] is:

${p}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}=G\left({\theta}_{0}+{\theta}_{1}^{{i}_{1}}+{\theta}_{1}^{{i}_{2}}+\cdots +{\theta}_{1}^{{i}_{m}}+{\theta}_{2}^{{i}_{1}{i}_{2}}+{\theta}_{2}^{{i}_{1}{i}_{3}}+\cdots +{\theta}_{2}^{{i}_{m-1}{i}_{m}}+\cdots +{\theta}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}\right).$ (16)

5.3. An Example

Let an ANN have 3 neurons:

$\left({x}_{1},{x}_{2},{x}_{3}\right)$

and let a distribution be:

$\left({p}_{0},{p}_{1},{p}_{2},{p}_{3},{p}_{4},{p}_{5},{p}_{6},{p}_{7}\right).$

Assume that the generating functions are:

$F\left(y\right)=\mathrm{log}\left(y\right),\text{\hspace{0.17em}}\text{\hspace{0.17em}}G\left(y\right)=\mathrm{exp}\left(y\right).$

By θ-transformation, the components are [12] [13] [14] :

${\theta}_{0}=\mathrm{log}{p}_{0},\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\theta}_{1}=\mathrm{log}\frac{{p}_{1}}{{p}_{0}},\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\theta}_{2}=\mathrm{log}\frac{{p}_{2}}{{p}_{0}},$

${\theta}_{3}=\mathrm{log}\frac{{p}_{3}{p}_{0}}{{p}_{1}{p}_{2}},\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\theta}_{4}=\mathrm{log}\frac{{p}_{4}}{{p}_{0}},\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\theta}_{5}=\mathrm{log}\frac{{p}_{5}{p}_{0}}{{p}_{1}{p}_{4}},\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\theta}_{6}=\mathrm{log}\frac{{p}_{6}{p}_{0}}{{p}_{2}{p}_{4}},$

${\theta}_{4}={\theta}_{3}^{123}=\mathrm{log}\frac{{p}_{3}^{123}{p}_{1}^{2}{p}_{1}^{2}{p}_{1}^{3}}{{p}_{2}^{12}{p}_{2}^{13}{p}_{2}^{23}{p}_{0}}=\mathrm{log}\frac{{p}_{7}{p}_{1}{p}_{2}{p}_{4}}{{p}_{3}{p}_{5}{p}_{6}{p}_{0}}$

The inverse are:

${p}_{0}=\mathrm{exp}\left({\theta}_{0}\right),{p}_{1}=\mathrm{exp}\left({\theta}_{0}+{\theta}_{1}\right),{p}_{2}=\mathrm{exp}\left({\theta}_{0}+{\theta}_{2}\right),{p}_{3}=\mathrm{exp}\left({\theta}_{0}+{\theta}_{1}+{\theta}_{2}+{\theta}_{3}\right),$

${p}_{4}=\mathrm{exp}\left({\theta}_{0}+{\theta}_{4}\right),{p}_{5}=\mathrm{exp}\left({\theta}_{0}+{\theta}_{1}+{\theta}_{4}+{\theta}_{5}\right),{p}_{6}=\mathrm{exp}\left({\theta}_{0}+{\theta}_{2}+{\theta}_{4}+{\theta}_{6}\right),$

${p}_{7}=\mathrm{exp}\left({\theta}_{0}+{\theta}_{1}+{\theta}_{2}+{\theta}_{3}+{\theta}_{4}+{\theta}_{5}+{\theta}_{6}+{\theta}_{7}\right).$

Because of the nature of the exponential function, the 0 probability is 0 = e^{−∞}, so the minimum of the probability, p_{i}, will be some very small value, ε, rather than 0 to avoid singularity.

Example

Let $p=\left\{2,7,3,8,2,5,5,6\right\}$ , then $\theta =\left\{0.693,1.252,0.405,-0.271,0,-0.336,0.510,-0.462\right\}$ .

Example

Let $\theta =\left\{0,0,0,0,0,0,0,2.302\right\}$ , then $p=\left\{1,1,1,1,1,1,1,10\right\}$ .

Example

Let $\theta =\left\{2.302,-0.223,-1.609,1.832,-0.510,-0.875,1.763,-1.581\right\}$ , then $p=\left\{10,8,2,10,6,2,7,3\right\}$ .

6. θ-Transformation Is Complete

Because the θ-transformation is implemented by normal function, FG = GF = I, as long as there is no singular points in the transformation, any distribution function can be expanded. For example, in the last section, we require ${p}_{i}\ge \epsilon $ , which is a predefined small number.

7. ABM Is Complete

An ABM with n neurons has neural connections up to the n^{th} order. The invariant distribution is:

$p\left(x\right)={\text{e}}^{H}$ ,

$H={\theta}_{0}+\sum {\theta}_{1}^{{i}_{1}}{x}_{{i}_{1}}+\sum {\theta}_{2}^{{i}_{1}{i}_{2}}{x}_{{i}_{1}}{x}_{{i}_{2}}+\sum {\theta}_{3}^{{i}_{1}{i}_{2}{i}_{3}}{x}_{{i}_{1}}{x}_{{i}_{2}}{x}_{{i}_{3}}+\cdots +{\theta}_{n}^{12\cdots n}{x}_{1}{x}_{2}\cdots {x}_{n}.$ (17)

An ABM implements a θ-transformation [12] [13] [14] with:

$F\left(y\right)=\mathrm{log}\left(y\right),\text{\hspace{0.17em}}G\left(y\right)=\mathrm{exp}\left(y\right).$ (18)

Furthermore, the “connection matrix” element can be calculated as follows [6] [7] [8] :

${\theta}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}=\mathrm{log}\frac{{p}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}{p}_{m-2}^{{i}_{1}\cdots {i}_{m-2}}\cdots {p}_{m-2}^{{i}_{3}\cdots {i}_{m}}{p}_{m-4}^{\cdots}}{{p}_{m-1}^{{i}_{1}\cdots {i}_{m-1}}\cdots {p}_{m-1}^{{i}_{2}\cdots {i}_{m}}{p}_{m-3}^{{i}_{1}\cdots {i}_{m-3}}\cdots}$ (19)

The reverse problem is as follows: given an ABM, the invariant distribution can be calculated as follows [12] [13] [14] :

${p}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}=\mathrm{exp}\left({\theta}_{\text{}}{}_{0}+{\theta}_{1}^{{i}_{1}}+{\theta}_{1}^{{i}_{2}}+\cdots +{\theta}_{1}^{{i}_{m}}+{\theta}_{2}^{{i}_{1}{i}_{2}}+{\theta}_{2}^{{i}_{1}{i}_{3}}+\cdots +{\theta}_{2}^{{i}_{m-1}{i}_{m}}+\cdots +{\theta}_{m}^{{i}_{1}{i}_{2}\cdots {i}_{m}}\right).$ (20)

Therefore, an ABM can realize a θ-expansion, which in turn can approximate any distribution. ABM is complete [12] [13] [14] .

8. Reduction of ANN from Higher Order to Lower Order

In this section, we show that we can reduce a higher order ANN to a lower order ANN by introducing more layers. We start with the base case, three neurons; then we go to the inductive step of the mathematical induction.

8.1. Third Order ABM

Assume that we have a first ANN with three neurons in one layer, $x=\left\{{x}_{1},{x}_{2},{x}_{3}\right\}$ ; the higher order distribution is:

$p\left(x\right)={\text{e}}^{H}$ ,

$H={\theta}_{0}+{\displaystyle \underset{{i}_{1}=1}{\overset{3}{\sum}}{\theta}_{1}^{{i}_{1}}{x}_{{i}_{1}}}+{\displaystyle \underset{{i}_{1}<{i}_{2},{i}_{1}=1}{\overset{3}{\sum}}{\theta}_{2}^{{i}_{1}{i}_{2}}{x}_{{i}_{1}}{x}_{{i}_{2}}}+{\theta}_{3}^{123}{x}_{1}{x}_{2}{x}_{3}$ (21)

There is only one third order term in the above distribution for 3 neurons.

We simulate the first network with a second ANN with two layers, $x=\left\{{x}_{1},{x}_{2},{x}_{3}\right\}$ and $y=\left\{{y}_{1},{y}_{2},{y}_{3},{y}_{4}\right\}$ . The transition from the first layer to the second layer is:

${y}_{1}={x}_{1},\text{}{y}_{2}={x}_{2},\text{}{y}_{3}={x}_{3},\text{}{y}_{4}={x}_{1}\ast {x}_{2}$ .

Let
$\left\{{y}_{1},{y}_{2},{y}_{3}\right\}$ be the same network_{ }as the first ANN without the higher order term. The neuron, y_{4}, has 4 additional connections, which will be defined as follows:

${\theta}_{1}^{4}=0,{\theta}_{2}^{34}={\theta}_{3}^{123},\text{\hspace{0.17em}}{\theta}_{2}^{14}={\theta}_{2}^{24}=0$ .

The second order distribution of $\left\{{y}_{1},{y}_{2},{y}_{3},{y}_{4}\right\}$ is:

$p\left(y\right)={\text{e}}^{H}$ ,

$H={\theta}_{0}+{\displaystyle \underset{{i}_{1}=1}{\overset{4}{\sum}}{\theta}_{1}^{{i}_{1}}{y}_{{i}_{1}}}+{\displaystyle \underset{{i}_{1}<{i}_{2},{i}_{1}=1}{\overset{4}{\sum}}{\theta}_{2}^{{i}_{1}{i}_{2}}{y}_{{i}_{1}}{y}_{{i}_{2}}}.$

There are only connections up to the second order. Separating y4, we have:

$\underset{{i}_{1}=1}{\overset{4}{\sum}}{\theta}_{1}^{{i}_{1}}{y}_{{i}_{1}}}={\displaystyle \underset{{i}_{1}=1}{\overset{3}{\sum}}{\theta}_{1}^{{i}_{1}}{x}_{{i}_{1}}}+{\theta}_{1}^{4}{y}_{4$ ,

$\underset{{i}_{1}<{i}_{2},{i}_{1}=1}{\overset{4}{\sum}}\sum {\theta}_{2}^{{i}_{1}{i}_{2}}{y}_{{i}_{1}}{y}_{{i}_{2}}}={\displaystyle \underset{{i}_{1}<{i}_{2},{i}_{1}=1}{\overset{3}{\sum}}{\theta}_{2}^{{i}_{1}{i}_{2}}{x}_{{i}_{1}}{x}_{{i}_{2}}}+{\theta}_{2}^{34}{y}_{3}{y}_{4}+{\theta}_{2}^{14}{y}_{1}{y}_{4}+{\theta}_{2}^{24}{y}_{2}{y}_{4$ .

Using the conditions:

${\theta}_{1}^{4}=0,{\theta}_{2}^{34}={\theta}_{3}^{123},\text{\hspace{0.17em}}{\theta}_{2}^{14}={\theta}_{2}^{24}=0$ ,

and substituting y-neurons by x-neurons:

${\theta}_{1}^{4}{y}_{4}=0,\text{}{\theta}_{2}^{34}{y}_{3}{y}_{4}={\theta}_{3}^{123}{x}_{1}{x}_{2}{x}_{3},\text{}{\theta}_{2}^{14}{y}_{1}{y}_{4}=0,\text{}{\theta}_{2}^{24}{y}_{2}{y}_{4}=0$ ,

we have:

$H={\theta}_{0}+{\displaystyle \underset{{i}_{1}=1}{\overset{3}{\sum}}{\theta}_{1}^{{i}_{1}}{x}_{{i}_{1}}}+{\displaystyle \underset{{i}_{1}<{i}_{2},{i}_{1}=1}{\overset{3}{\sum}}{\theta}_{2}^{{i}_{1}{i}_{2}}{x}_{{i}_{1}}{x}_{{i}_{2}}}+{\theta}_{3}^{123}{x}_{1}{x}_{2}{x}_{3}$ .

So in this base case, we have proven that the invariant distribution of a higher order ANN can be precisely duplicated by the invariant distribution of a lower order ANN with multiple layers.

8.2. Higher Order ABM

We now show how to reduce the n^{th} order term to the (n − 1) order term.

Consider a first ANN has neural connections up to the n^{th} order. There is only one n^{th}-order term:

${\theta}_{n}^{12\cdots n}{x}_{1}{x}_{2}\cdots {x}_{n}$ .

The rest of the connections are at most (n − 1) order.

We simulate the first network with a second ANN with additional layers, which has an additional layer $y=\left\{{y}_{1},{y}_{2},{y}_{3},\cdots ,{y}_{n},{y}_{n+1}\right\}$ added on the top of the first ANN. This second ANN has neural connections up to the (n − 1) order. The transitions from the first layer to the second layer are defined as:

${y}_{1}={x}_{1},{y}_{2}={x}_{2},\cdots ,{y}_{n}={x}_{n},{y}_{n+1}={x}_{1}\ast {x}_{2}.$

Based on the same approach in the last section, we can reduce this term from the order of n to the order of n − 1. The neuron, y_{n}_{+1}, has many additional connections. With the exception of the following term:

${\theta}_{n-1}^{34\cdots n+1}={\theta}_{n}^{12\cdots n}$ ,

all other connections will be defined as 0:

${\theta}_{1}^{n+1}=0,{\theta}_{2}^{1n+1}={\theta}_{2}^{2n+1}=\cdots =0,\cdots ,{\theta}_{3}^{12n+1}={\theta}_{3}^{13n+1}=\cdots =0,\cdots $

The (n − 1) order distribution of $\left\{{y}_{1},{y}_{2},{y}_{3},\cdots ,{y}_{n},{y}_{n+1}\right\}$ is:

$p\left(y\right)={\text{e}}^{H}$ ,

$H={\theta}_{0}+{\displaystyle \underset{{i}_{1}=1}{\overset{n+1}{\sum}}{\theta}_{1}^{{i}_{1}}{y}_{{i}_{1}}}+{\displaystyle \underset{{i}_{1}<{i}_{2},{i}_{1}=1}{\overset{n+1}{\sum}}{\theta}_{2}^{{i}_{1}{i}_{2}}{y}_{{i}_{1}}{y}_{{i}_{2}}}+\cdots +{\displaystyle \underset{{i}_{1}<{i}_{2},{i}_{1}=1}{\overset{n+1}{\sum}}{\theta}_{n-1}^{{i}_{1}{i}_{2}\cdots}{y}_{{i}_{1}}{y}_{{i}_{2}}}+\cdots $

Separating y_{n}_{+1} from the other first order term, substituting y-neurons by x-neurons, and using the condition
${\theta}_{1}^{n+1}=0$ , we have:

$\underset{{i}_{1}=1}{\overset{n+1}{\sum}}{\theta}_{1}^{{i}_{1}}{y}_{{i}_{1}}}={\displaystyle \underset{{i}_{1}=1}{\overset{n}{\sum}}{\theta}_{1}^{{i}_{1}}{x}_{{i}_{1}}}+{\theta}_{1}^{n+1}{y}_{n+1}={\displaystyle \underset{{i}_{1}=1}{\overset{n}{\sum}}{\theta}_{1}^{{i}_{1}}{x}_{{i}_{1}}$

Separating y_{n}_{+1} from the other second order term, substituting y-neurons by x-neurons, and using the conditions,
${\theta}_{2}^{1n+1}={\theta}_{2}^{2n+1}=\cdots =0$ , we have:

$\begin{array}{c}{\displaystyle \underset{{i}_{1}<{i}_{2}<,{i}_{1}=1}{\overset{n+1}{\sum}}{\theta}_{2}^{{i}_{1}{i}_{2}}{y}_{{i}_{1}}{y}_{{i}_{2}}}={\displaystyle \underset{{i}_{1}<{i}_{2},{i}_{1}=1}{\overset{n}{\sum}}{\theta}_{2}^{{i}_{1}{i}_{2}}{y}_{{i}_{1}}{y}_{{i}_{2}}}+{\theta}_{2}^{1n+1}{y}_{1}{y}_{n+1}+{\theta}_{2}^{2n+1}{y}_{2}{y}_{n+1}+\cdots \\ ={\displaystyle \underset{{i}_{1}<{i}_{2}<,{i}_{1}=1}{\overset{n}{\sum}}{\theta}_{2}^{{i}_{1}{i}_{2}}{y}_{{i}_{1}}{y}_{{i}_{2}}}\end{array}$

…

Separating y_{n}_{+1} from the (n − 1) order term, substituting y-neurons by x-neurons, and using the condition:

${\theta}_{n-1}^{34\cdots n+1}={\theta}_{n}^{12\cdots n},\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\theta}_{n-1}^{12\cdots n+1}={\theta}_{n-1}^{13\cdots n+1}=\cdots =0,$

we have:

$\underset{{i}_{1}<{i}_{2}<,{i}_{1}=1}{\overset{n+1}{\sum}}{\theta}_{n-1}^{{i}_{1}{i}_{2}\cdots}{y}_{{i}_{1}}{y}_{{i}_{2}}\cdots}={\displaystyle \underset{{i}_{1}<{i}_{2}<,{i}_{1}=1}{\overset{n}{\sum}}{\theta}_{n-1}^{{i}_{1}{i}_{2}\cdots}{x}_{{i}_{1}}{x}_{{i}_{2}}}\cdots +{\theta}_{n}^{12\cdots n}{x}_{1}{x}_{2}\cdots {x}_{n$

Combine all of the above terms, we have:

$p\left(x\right)={\text{e}}^{H}$

$H={\theta}_{0}+\sum {\theta}_{1}^{{i}_{1}}{x}_{{i}_{1}}+\sum {\theta}_{2}^{{i}_{1}{i}_{2}}{x}_{{i}_{1}}{x}_{{i}_{2}}+\sum {\theta}_{3}^{{i}_{1}{i}_{2}{i}_{3}}{x}_{{i}_{1}}{x}_{{i}_{2}}{x}_{{i}_{3}}+\cdots +{\theta}_{n}^{12\cdots n}{x}_{1}{x}_{2}\cdots {x}_{n}$ (22)

Now, we have proven that the invariant distribution of an n^{th} order ANN can be precisely duplicated by the invariant distribution of a (n − 1) order ANN with multiple layers.

We have shown both reduction from n = 3 to n = 2 and reduction from arbitrary n to n − 1.

9. Completeness of Deep Neural Networks

Given an unknown function, one can find a θ-expansion to convert it from the x-coordinate system to the θ-coordinate system using the following generating function:

$F\left(y\right)=\mathrm{log}\left(y\right),\text{\hspace{0.17em}}\text{\hspace{0.17em}}G\left(y\right)=\mathrm{exp}\left(y\right).$

This expansion can be represented by an ABM.

ABM is a higher order ANN which can be expanded into multiple layers with only second order terms. Therefore, a Deep Neural Network can realize a θ-expansion, which in turn can approximate any distribution. Therefore, the Deep Neural Network is complete.

We stress that we merely provide the proof for the existence of a solution for simulating any mapping from A to B. We did not seek for an effective implementation of a Deep Neural Network from supervised training data, (A, B). We have merely proved that this is possible.

10. Conclusions

Hornik, Stinchcombe, & White have shown that the multilayer feed forward networks with enough hidden layers are universal approximators. Roux & Bengio have shown the same. In this paper, we provide yet another proof. The advantage of this proof is that it will lead to several new learning algorithms, which will be presented in our future publications.

In conclusion, we have studied the completeness problem of the Deep Neural Networks. We have reviewed the Attrasoft Boltzmann Machine (ABM). An ABM with n neurons has neural connections up to the n^{th} order. We have reviewed the θ-transformation and shown that the θ-transformation is complete, i.e. any function can be expanded by θ-transformation. We have further shown that the invariant distribution of the ABM is a θ-transformation; therefore, an ABM can simulate any distribution. An ABM with n neurons has neural connections up to the n^{th} order. We have shown how to convert an ABM to a Deep Neural Network. Finally, by establishing the equivalence between an ABM and the Deep Neural Network, we have proven that the Deep Neural Network is complete. In other words, the Deep Neural Network can simulate any mapping from A to B.

Acknowledgements

I would like to thank Gina Porter for proof reading this paper.

References

[1] Hinton, G.E., Osindero, S. and Teh, Y. (2006) A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18, 1527-1554.

https://doi.org/10.1162/neco.2006.18.7.1527

[2] Bengio, Y. (2009) Learning Deep Architectures for AI (PDF). Foundations and Trends in Machine Learning, 2, 1-127.

https://doi.org/10.1561/2200000006

[3] Amari, S., Kurata, K. and Nagaoka, H. (1992) Information Geometry of Boltzmann Machine. IEEE Transactions on Neural Networks, 3, 260-271.

https://doi.org/10.1109/72.125867

[4] Byrne, W. (1992) Alternating Minimization and Boltzmann Machine Learning. IEEE Transactions on Neural Networks, 3, 612-620.

https://doi.org/10.1109/72.143375

[5] Coursera (2017).

https://www.coursera.org/

[6] TensorFlow (2017).

https://www.tensorflow.org/

[7] Torch (2017).

http://torch.ch/

[8] Theano (2017).

http://deeplearning.net/software/theano/introduction.html

[9] Hornik, K., Stinchcombe, M. and White, H. (1989) Multilayer Feedforward Networks Are Universal Approximators. Neural Networks, 2, 359-366.

https://doi.org/10.1016/0893-6080(89)90020-8

[10] Le Roux, N. and Bengio, Y. (2008) Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Neural Computation, 20, 1631-1649.

https://doi.org/10.1162/neco.2008.04-07-510

[11] Feller, W. (1968) An Introduction to Probability Theory and Its Application. John Wiley and Sons, New York.

[12] Liu, Y. (1993) Image Compression Using Boltzmann Machines. Proceedings of SPIE, 2032, 103-117.

https://doi.org/10.1117/12.162027

[13] Liu, Y. (1995) Boltzmann Machine for Image Block Coding. Proceedings of SPIE, 2424, 434-447.

https://doi.org/10.1117/12.205245

[14] Liu, Y. (1997) Character and Image Recognition and Image Retrieval Using the Boltzmann Machine. Proceedings of SPIE, 3077, 706-715.

https://doi.org/10.1117/12.271533

[15] Liu, Y. (2002) US Patent No. 7, 773, 800.

http://www.google.com/patents/US7773800