Pi-sigma network (PSN) is a kind of high order feedforward neural network which is characterized by the fast convergence rate of the single-layer network, and the unique high order network nonlinear mapping capability  . In order to further improve the application capacity of the network, Li introduces more complex network structures based on PSN called sigma-pi-sigma neural network (SPSNN)  . SPSNN can be learned to implement static mapping in the similar manner to that of multilayer neural networks and the radial basis function networks.
The gradient method is often used for training neural networks, and the main disadvantages of this method are the slow convergence and the local minimum problem. To speed up and stabilize the training iteration procedure for the gradient method, a momentum term is often added to the increment formula for the weights, in which the present weight updating increment is a combination of the present gradient of the error function and the previous weight updating increment  . Many researchers have developed the theory about momentum and extended its applications. For the back-propagation algorithm, Phansalkar and Sastry give a stability analysis with adding the momentum term  . Torii and Bhaya discuss the convergence of the gradient method with momentum under the restriction that the error function is quadratic   . Shao et al. study the adaptive momentum for both batch gradient method and online gradient method, and compare the efficiency of momentum with penalty      . The key for the convergence analysis for momentum algorithms is the monotonicity of the error function during the learning procedure, which is generally proved under the uniformly boundedness assumption of the activation function and its derivatives. In     , for the gradient method with momentum, some convergence results are given for both two-layer and multi-layer feedforward neural networks. In this paper, we will consider the gradient method with momentum for sigma-pi-sigma neural networks and discuss its convergence.
The rest of the paper is organized as follows. In Section 2 we introduce the neural network model of SPSNN and the gradient method with momentum. In Section 3 we give the convergence analysis of the gradient method with momentum for training SPSNN. Numerical experiments are given in Section 4. Finally, in Section 5, we end the paper with some conclusions.
2. The Neural Network Model of SPsnn and Gradient Method with Momentum
In this section we introduce the sigma-pi-sigma neural network that is composed of multilayer neural network. The output of SPSNN has the form , where is an input, is the number of inputs, is a function to be generated through the network training, and K is the number of pi-sigma network(PSN) that is the basic building block for SPSNN. The expression of the function is , where the function is either 0 or 1, and is weight values stored in memory. and are information numbers stored in . For a K-th order SPSNN, the total weight value will be
For a set of training examples , where is the ideal output, , we have the following actual output:
where denotes the jth element of a given input vector .
In order to train the SPSNN, we choose a quadratic error function :
where . For convenience we denote .
The gradient method with momentum is used to train weights. The gradients of and are denoted by
and the Hessian matrices of and at are denoted by
Given any arbitrarily initial weight vectors , , the gradient method with momentum updates the weight vector W by
where is the learning rate, is called the momentum term, is the momentum coefficient.
Similar to   , in this paper, we choose as follows:
where m is a positive number and , and is 2-norm in this paper.
Notice the component form of (1) is
where . Recalling , then
3. Convergence Results
Similar to   , we need the following assumptions.
(A1): The elements of the Hessian matrix be uniformly bounded for any .
(A2): The number of the elements of be finite.
From (A1), it is easy to see that there exists a constant such that
Lemma 3.1 (  ) Let be continuously differentiable, the number of the elements of the set be finite, and the sequence satisfy
Then there exists a such than
Theorem 3.2 If Assumption (A1) is satisfied. Then there exists such that for and , it holds the following weak convergence result for the iteration (1):
Furthermore, if Assumption (A2) is also valid, then it holds the strong convergence result, that is there exists such that
Using Taylor’s formula, we expand at :
where lies in between and .
From (2) we have
The above equation is equivalent to
It is easy to see that
Together with (3), we have
Set . Then
It is easy to see that when
If η and μ satisfy (5), then the sequence is monotonically decreasing. Since is nonnegative, it must converge to some , that is
By (4) it is easy to see for any positive integer N, it holds
Let , then we have , so , which finishes the proof for the weak convergence.
By (1), we have
From Lemma 3.1, it holds
which finishes the proof for the strong convergence.
4. Numerical Results
In this section, we propose an example to illustrate the convergence behavior of the iteration (1) by comparing the iteration steps (IT), elapsed CPU time in seconds (CPU) and relative residual error (RES). The experiment is terminated when the current iteration satisfies or the number of the max iteration steps k = 1000 are exceeded. The computations are implemented in MATLAB on a PC computer with Intel (R) Core (R) CPU 1000 M @ 1.80 GHz, and 2.00 GB memory.
Example 4.1 (  ) Four-dimensional parity problem (Table 1)
Table 1. The data samples.
Table 2. Optimal parameters, CPU times, iteration numbers, and residuals.
In this simulation experiment, the initial weights is a zero vector of 24 dimensional and is a 24 dimensional vector whose elements are all 1. The learning rate and momentum factor . The number of training samples is . In the above Table 2, we compare the convergence behavior of the gradient method with momentum and the gradient method with no momentum. It can be seen that the network training is improved significantly after added the momentum item.
In this paper, we study the gradient method with momentum for training sigma-pi-sigma neural networks. We take the momentum coefficient in an adaptive manner, and the corresponding weak convergence and strong convergence results are proved. The Assumptions A1 and A2 in this paper seem to be a little severe, so how to weaken the one or two assumptions will be our future work.
This author is supported by National Natural Science Foundation of China under grant No. 61572018 and Zhejiang Provincial Natural Science Foundation of China under Grant No. LY15A010016.
 Shin, Y. and Ghosh, J. (1991) The Pi-Sigma Network: An Efficient Higher-Order Neural Network for Pattern Classification and Function Approximation. International Joint Conference on Neural Networks, 1, 13-18.
 Bhaya, A. and Kaszkurewicz, E. (2004) Steepest Descent with Momentum for Quadratic Functions Is a Version of the Conjugate Gradient Method. Neural Networks, 17, 65-71.
 Shao, H., Xu, D., Zheng, G. and Liu, L. (2012) Convergence of an Online Gradient Method with Inner-Product and Adaptive Momentum. Neurocomputing, 747, 243-252.
 Zhang, N., Wu, W. and Zheng, G. (2006) Convergence of Gradient Method with Momentum for Two-Layer Feedforward Neural Networks. IEEE Transactions Neural Networks, 17, 522-525.