Design of mathematical models for technical, economic, social and other systems with uncertainties is the important problem from both theoretical and practical points of view. This problem attracts close attention of many researches. The significant progress in this scientific area has been achieved last time. Within this area, new methods and modern intelligent algorithms dealing with uncertain systems have recently been proposed in     . They include some new optimization approaches advanced, in particular, in the papers   .
Over the past decades, interest has been increasing toward the use multilayer neural networks applied among other as models for the adaptive identification of nonlinearly parameterized dynamic systems     . This has been motivated by the theoretical works of several researches   who proved that, even with one hidden layer, neural network can uniformly approximate any continuous mapping over a compact domain, provided that the network has sufficient number of neurons with corresponding weights. The theoretical background on neural network modeling may be found in the book  .
Different learning methods for updating the weights of neural networks have been reported in literature. Most of these methods rely on the gradient concept  . One of these methods is based on utilizing the Lyapunov stability theory   .
The convergence of the online gradient training procedure dealing with input signals that have deterministic (non-stochastic) nature was studied by many authors  -  . Several of these authors assumed that training set must be finite whereas in online identification schemes, this set is theoretically infinite. Nevertheless, recently we observed a non-stochastic learning process when this procedure did not converge for certain infinite sequence of training examples  .
The probabilistic asymptotic analysis on convergence of the online gradient training algorithms has been conducted in  -  . Several of their results make it possible to employ a constant learning rate   . To the best of author’s knowledge, there are no general results in literature concerning the global convergence properties of training procedures with a fixed learning rate applicable to the case of infinite learning set.
A popular approach to analyze the asymptotic behavior of online gradient algorithms in stochastic case is based on Martingale convergence theory  . This approach has been exploited by the authors in  to derive some local convergence in stochastic framework for standard online gradient algorithms with the constant learning rate.
The difficulties associated with convergence properties of online gradient learning algorithms are how to guarantee the boundedness of the network weights biases assuming the learning process to be theoretically infinite. To overcome these difficulties, the penalty term to an error function has been introduced in  . Recently we however established in  that the global convergence of these algorithms with probability 1 can be achieved without any additional term, at least, in the case when the activation function of the network output layer is linear.
This work has been motivated by the fact that the standard gradient algorithm is widely exploited for online updating the neural network weights in accordance with the gradient-descent principle whereas the following important question related to its ultimate properties remained in part open as yet: when does the sequential procedure based on this algorithm converge if the learning rate is constant? As pointed out in  , the answer to the question on convergence properties of this standard algorithm which should shed some light on asymptotic features of multilayer neural networks using the gradient-like training technique is the first step toward a full understanding of other more generic training algorithms based on regularization, conjugate gradient, and Newton optimization methods, etc.
Novelty of this paper which extends the basic ideas of  to the case where the activation function of the output layer is nonlinear, consists in establishing sufficient conditions under which the gradient algorithm for learning neural networks will globally converge in the sense almost sure for the case when the learning rate can be constant. The proposed approach to deriving these convergence results is based on utilizing the Lyapunov methodology  . They make it possible to reveal some new features of the multilayer neural networks with nonlinear activation function in output layer which use the online gradient-type training algorithms having a constant learning rate.
2. Description of Learning Neural Network System: Problem Formulation
Consider the typical three-layer feedforward neural network containing a hidden layer and p inputs, q hidden neurons, and one output neuron. Denote by
the weight matrix connecting the input and hidden layers, and define the so-called bias vector as
which is the threshold in the hidden-layer output. Further, let
be the weight vector between the hidden and output layers, and be the bias in the output layer. As in  , the activation functions used in the hidden neurons are all the same denoted by , and the activation function for the output layer is .
Now, denoting by
the vector-valued function which depends on the vector , introduce the extended matrix by adding the column to W and the extended vector , and also the function of z. Then the for an input vector
the output vector of hidden layer can be written as , where the notation of the extended vector is used, and the final output of the neural network can be expressed as follows:
with be an unknown and bounded nonlinearity given over the bounded either finite or infinite sets which are depicted in Figure 1 for the case . This function needs to be approximated by the neural network (1) via suitable choice of and . By virtue of (2) the approximation error
depends on x for any fixed .
Now, suppose that some complex system to be identified is described at each nth time instant by the equation
in which and are its input and output signals, respectively available for measurement.
Based on the infinite sequence of the training examples that is
generated by (4), the outline learning algorithm for updating the weight and biases in (1) is defined as the standard gradient-descent iteration procedure
In these equations, and denote the current gradients of the error function with respect to and ,
Figure 1. Training sets: (a) X is an infinite set of xs; (b) X is a finite set of xs.
respectively, obtained after substituting , , , and into (3), and represents the step size (the learning rate). Note that the expressions of and may be written in detail similar to that in   . (Due to space limitation, they are here omitted.)
Introducing the notation
of the extended weight and bias vector , and considering the Equations (5) and (6) in conjunction, rewrite the online gradient learning algorithm for updating in a general form (as in  )
where represents the gradient of with respect to calculated at the nth time instant.
Thus, the Equation (7) together with the expression
in which is given by (4), and
describe the learning neural network system necessary to identify the nonlinearity (2). For better understanding the performance of this system, its structure is depicted in Figure 2, where the notation is used.
The problem formulated in this paper consists in analyzing asymptotic properties of the learning neural network system presented above. More certainly, it is required to derive conditions under which the learning procedure will be convergent meaning the existence of a limit
in some sense  .
Suppose that there is a multilayer neural network described by
where is some fixed parameter vector. According to   , the requirement
evaluating an accuracy of the approximation of by can be satisfied for any via suitable choice of and the number of the neurons in its layers. On the other hand, the performance index of the neural network model with a fixed number of these neurons defining its approximation capability might naturally be expressed as follows:
Figure 2. Configuration of learning neural network system.
In fact, the desired (optimal) vector will then be specified from (9) as the variable minimizing :
Nevertheless, all researches which employ online learning procedures in stochastic environment “silently” replace by
where denotes the expected value of .
Indeed, the learning algorithm (7) does not minimize (9): namely, it minimizes (instead of )  . This observation means that (7) will at best yield
but not given by (10) as .
Now, consider a special case when the unknown function (2) can exactly be approximated by the neural network implying
In this case called in (  , p. 304) by the ideal case, we have for any x from X and, consequently, .
If the condition given in identity (11) is satisfied, then the learning rate in (7) may be constant:
see (  , sect. 3.13).
Note that the property (11) may take place, in particular, when
contains certain number of training examples
provided that their number does not exceed the dimension of . To understand this fact, according to (11) write the set of K equations
with respect to the unknown . They are compatible if . Due to (2) together with the definition of it can be concluded that their solution is just yielding because in this special case, for all .
4. Main Results
4.1. Some Feature of Multilayer Neural Network
It turns out that if the activation functions g of the hidden layer are nonlinear, then for an arbitrary fixed vector there is, at least, one vector such that the network outputs for these different vectors are the same even though the output activation function f is linear, i.e. if :
The feature (12) gives that in the presence of nonlinear g there exist, at least, two different . For example, let and
in which , and with . Fix a . Then will also satisfy (12); see  . Therefore, the set of will be not one-point if g is nonlinear.
4.2. An Observation
To study some asymptotic properties of sequence caused by the learning algorithm (7) in the non-stochastic case, simulation experiments with the scalar nonlinear system (2) having the nonlinearity
were conducted. This nonlinearity can explicitly be approximated by the two-layer neural network model described by as in Subsection 4.1 with and .
Figure 3 illustrates the results of the one simulation experiment with , where was chosen as a non-stochastic sequence. It can be observed that in this example, the variable shown in Figure 3(b) has no limit implying that the learning algorithm (7) may not be convergent: in this case, the limit (8) does not exist, see Figure 3(c).
Figure 3. Behaviour of learning algorithm (7) in non-stochastic case: (a) inputs ; (b) the variable ; (c) current model error .
4.3. Sufficient Conditions for the Probabilistic Convergence of Learning Procedure
The following basic assumption concerning which is bounded stochastic sequence (since X is bounded) is made:
(A1) arise randomly in accordance with a probability distribution if X is finite, and with probability density if X is infinite.
Within assumption (A1), the expected value (mean) of is given by
To derive the main theoretical result we need Assumption (A1) and the following additional assumptions:
(A2) the identity (11) holds;
(A3) the activation functions used in the hidden neurons and output neuron are the same , twice continuously differentiable on and also uniformly bounded on .
Further, we introduce a scalar function playing a role of the Lyapunov function  with the features:
(a) is nonnegative, i.e.,
(b) is the Lipschitz function in the sense that
for any from , where denotes its gradient, and represents the Lipschitz constant.
Now, the global stochastic convergence analysis of the gradient learning algorithm (7) is based on employing the fundamental convergence conditions established in the following Key Technical Lemma which is the slightly reformulated Theorem 3 of  .
Key Technical Lemma. Let be a function satisfying (13) and (14). Define the scalar variable
with some , and denote
Introduce the additional variable
Then the algorithm (7) yields
where provided that and
Related results followed from the Theorem 3’ of  are:
Corollary. Under the conditions of the Key Technical Lemma, if
and , and , then
with probability 1 provided that
is satisfied. n
Next, we are able to present the convergence result summarized in the theorem below.
Theorem. Suppose Assumption (A2) holds. Then the gradient algorithm (7) with a constant learning rate, , will converge with probability 1 (in the
sense that a.s.) and
for any initial chosen randomly so that if satisfies
the conditions (20) with and determined by
Proof. Set . Then condition (13) and (14) can be shown to be valid. This indicates that this function may be taken as the Lyapunov function. By virtue of (16) such a choice of gives . Putting and with and determined by (22) and (23), respectively, we can conclude that the conditions 1), 2) of the Key Technical Lemma are satisfied. Applying its Corollary it proves that with probability 1.
Due to the fact that together with Assumption (A2), result (21) follows. n
4.4. Simulations and a Discussion
To demonstrate theoretical result given in Subsection 4.3, several simulations were conducted. First, we dealt with the same neural network and the same training samples as in (  , p. 1052). Namely, they were chosen as follows:
The two numerical examples with different initial were considered. In Example 1 we set , , , , , , , , . In Example 2 we set , , , , , , , , .
Contrary to  the learning rate was chosen as in order to implement the algorithms (5), (6) with no penalty term.
Further, another simulation experiments were also conducted. In contrast with previous experiments, they dealt with an infinite training sets X Namely, the two simulations with the same nonlinear function as in Subsection 4.2 were first conducted, provided that X is the infinite bounded set given by . However, was now chosen as the stochastic sequence. Namely, it was generated as a pseudorandom i.i.d. sequence.
Two numerical examples were considered. In Example 3, the initial values of neural network weights and biases were taken as follows: ,
Figure 4. Behavior of gradient learning algorithm (7) in Example 1.
Figure 5. Behavior of gradient learning algorithm (2) in Example 2.
, , . In Example 4 we set , , , . Figure 6 and Figure 7 demonstrate results of the two simulation experiments conducted with the initial estimates given above. In both experiments, was also chosen as .
Next, another nonlinearity
with and to be exactly approximated by a suitable neural network was chosen as in [11, p. 12-4]. The following initial estimates were taken: , , , , , , (Example 5), and , , , , , , (Example 6).
From Figures 4-9 we can see that the learning processes converge and the performance index tends to zero while the penalty term is absent. It can be observed that if the initial vectors are different then the sequences may converge to different final .
The simulation experiments show that the penalty term is not necessary, in principle, to achieve the convergence of the online gradient learning procedure in the three-layer neural networks if certain conditions given by Assumption (A1)-(A3) are satisfied. This fact supports our theoretical results.
Figure 6. Behavior of gradient learning algorithm (7) in Example 3.
Figure 7. Behavior of gradient learning algorithm (7) in Example 4.
Figure 8. Behavior of gradient learning algorithm (7) in Example 5.
Figure 9. Behavior of gradient learning algorithm (7) in Example 6.
In this paper, some important features of multilayer neural networks which are utilized as nonlinearly parameterized models of unknown nonlinear systems to be identified have been derived. A special case where the nonlinearity can exactly be approximated by a three-layer neural network has been studied. Contrary to previous author’s papers we dealt with the neural network having a nonlinear activation function for its output layer. It was shown that if the activation function of the hidden layer is nonlinear, then, for any input variables, there are, at least, two different network parameter vectors under which the network outputs will be the same even though the output activation function is linear. This feature gives that the standard gradient online training algorithm with a constant learning rate may not be convergent if the training sequence is non-stochastic. Nevertheless, provided that this sequence is stochastic, it has theoretically been established that, under certain conditions, such algorithm will converge with probability one. However, ultimate values of network parameters may be different. These facts were confirmed by simulation experiments.
The authors are grateful to anonymous reviewer for his valuable comments.
 Chen, L., Peng, J., Zhang, B. and Rosyida, I. (2017) Diversified Models for Portfolio Selection Based on Uncertain Semivariance. International Journal of Systems Science, 3, 637-648.
 Draa, A., Bouzoubia, S. and Boukhalfa, I. (2015) A Sinusoidal Differential Evolution Algorithm for Numerical Optimisation. Applied Soft Computing, 27, 99-126.
 Zhang, B., Peng, J., Li, S. and Chen, L. (2016) Fixed Charge Solid Transportation Problem in Uncertain Environment and Its Algorithm. Computers & Industrial Engineering, 102, 186-197.
 Suykens, J. and Moor, B.D. (1993) Nonlinear System Identification Using Multilayer Neural Networks: Some Ideas for Initial Weights, Number of Hidden Neurons and Error Criteria. Proceedings of the 12th IFAC World Congress, Sydney, Australia, 3, 49-52.
 Kosmatopoulos, E.S., Polycarpou, M.M., Christodoulou, M.A. and Ioannou, P.A. (1995) High-Order Neural Network Structures for Identification of Dynamical Systems. IEEE Transactions on Neural Networks, 6, 422-431.
 Tsypkin, Ya.Z., Mason, J.D., Avedyan, E.D., Warwick, K. and Levin, I.K. (1999) Neural Networks for Identification of Nonlinear Systems Under Random Piecewise Polynomial Disturbances. IEEE Transactions on Neural Networks, 10, 303-311.
 Behera, L., Kumar, S. and Patnaik, A. (2006) On Adaptive Learning Rate That Guarantees Convergence in Feedforward Networks. IEEE Transactions on Neural Networks, 17, 1116-1125.
 Mangasarian, O.L. and Solodov, M.V. (1994) Serial and Parallel Backpropagation Convergence via Nonmonotone Perturbed Minimization. Optimization Methods of Software, 4, 103-116, 199.
 Luo, Z. and Tseng, P. (1994) Analysis of an Approximate Gradient Projection Method with Application to the Backpropagation Algorithm. Optimization Methods of Software, 4, 85-101.
 Ellacott, S.W. (1993) The Numerical Analysis Approach. In: Taylor, J.G., Ed., Mathematical Approaches to Neural Networks, Elsevier Science Publisher B.V., Amsterdam, 103-137.
 Wu, W. and Shao, Z. (2003) Convergence of an Online Gradient Methods for Continuous Perceptrons with Linearly Separable Training Patterns. Applied Mathematics Letters, 16, 999-1002.
 Wu, W. and Xu, Y.S. (2002) Deterministic Convergence of an Online Gradient Method for Neural Networks. Journal of Computational and Applied Mathematics, 144, 335-347.
 Wu, W., Feng, G.R., Li, X. and Xu, Y.S. (2005) Deterministic Convergence of an Online Gradient Method for BP Neural Networks. IEEE Transactions on Neural Networks, 16, 1-9.
 Wu, W., Feng, G. and Li, X. (2002) Training Multilayer Perceptrons via Minimization of Ridge Functions. Advances in Computational Mathematics, 17, 331-347.
 Wu, W., Shao, H. and Qu, D. (2005) Strong Convergence for Gradient Methods for BP Networks Training. Proceedings of 2005 International Conference on Neural Networks and Brain, Beijing, 13-15 October 2005, 332-334.
 Zhang, N., Wu, W. and Zheng, G. (2006) Convergence of Gradient Method with Momentum for Two-Layer Feedforward Neural Networks. IEEE Transactions on Neural Networks, 17, 522-525.
 Zhiteckii, L.S., Azarskov, V.N. and Nikolaienko, S.A. (2012) Convergence of Learning Algorithms in Neural Networks for Adaptive Identification of Nonlinearly Parameterized Systems. Proceedings 16th IFAC Symposium on System Identification Brussels, 11-13 July 2012, 1593-1598.
 Li, Z., Wu, W. and Tian, Y. (2004) Convergence of an Online Gradient Method for FNN with Stochastic Inputs. Journal of Computational and Applied Mathematics, 163, 165-176.
 White, H. (1989) Some Asymptotic Results for Learning in Single Hidden-Layer Feedforward Neural Network Models. Journal of the American Statistical Association, 84, 1003-1013.
 Finnoff, W. (1994) Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm and Resistance to Local Minima. Neural Computing and Applications, 6, 285-295.
 Gaivoronski, A.A. (1994) Convergence Properties of Backpropagation for Neural Nets via Theory of Stochastic Gradient Methods. Optimization Methods of Software, 4, 117-134.
 Tadic, V. and Stankovic, S. (2000) Learning in Neural Networks by Normalized Stochastic Gradient Algorithm: Local Convergence. Proceedings of the 5th Seminar on Neural Network Applications in Electrical Engineering, Yugoslavia, 26-27 September 2000, 11-17.
 Zhang, H., Wu, W., Liu, F. and Yao, M. (2009) Boundedness and Convergence of Online Gradient Method with Penalty for Feedforward Neural Networks. IEEE Transactions on Neural Networks, 20, 1050-1054.
 Azarskov, V.N., Kucherov, D.P., Nikolaienko, S.A. and Zhiteckii, L.S. (2015) Asymptotic Behaviour of Gradient Learning Algorithms in Neural Network Models for the Identification of Nonlinear Systems. American Journal of Neural Networks and Applications, 1, 1-10.