At present, Principal Component Analysis is widely used in data processing, image recognition and even video recognition . In general, QR decomposition and SVD decomposition are most commonly used when calculating principal components. However, when the data dimension is large, the calculation cost of the traditional decomposition matrix method is very high. And the approach is not suitable for real applications where data come incrementally or in the on-line way. The PCA algorithm can overcome these problems.
Regarding the PCA algorithm, the Hebbian learning rules were first proposed by Hebbian, but the Hebbian rules are unstable and easy to diverge. Oja  added a delayed item under the Hebbian learning rule premise, called Oja’s learning rule. The Oja’s rule could always converge to the principal component of the covariance matrix.
Many optimization methods could also obtain the principal components. Xu  proposed the LMSER PCA algorithm to extract the principal components based on the least mean square error reconstruction. Chatterjee  proposed an unconstrained objective function and obtained various new PCA adaptive algorithms using gradient descent method, steepest descent method, conjugate gradient method, and Newton method. The results showed that other methods converge faster than the gradient descent method. When there are outliers in the data, Xu  used a Gibbs distribution derivation, added a coefficient to the learning rate to detect outliers, and adjusted their weights called the robust PCA algorithm. Then Song  used a Cauchy distribution to redefine the weights. Yang  proposed the FRPCA algorithm, which automatically selects relevant parameters compared to Xu .
All of these PCA learning algorithms are described by stochastic discrete time systems, and convergence is crucial for these algorithms to be useful in applications. It is not easy to study the behavior of these PCA algorithms directly. In order to indirectly prove the convergence of the Oja’s learning algorithm, traditionally, the algorithms are first transformed into some deterministic continuous time (DCT) systems   . This transformation is based on a fundamental theorem for the study of the stochastic approximation algorithm , relating the discrete time stochastic model to a DCT formulation. The stochastic approximation theory requires strict assumptions, including the algorithm’s learning rate to converge to 0. However, these conditions are difficult to achieve in practice and affect convergence speed  . This makes it challenging to apply the DCT method to analyze the PCA algorithm’s convergence in practice. Subsequently, Zufiria  proposed to study Oja’s algorithm through a deterministic discrete time (DDT) system. This DDT system is obtained by applying conditional expectation to Oja’s learning algorithm. It preserves the original learning algorithm’s discrete time nature and can shed some light on the characteristics of the constant learning rate. Zhang  proposed that when , the convergence speed is the fastest. Based on this, Lv  proposed an adaptive non-zero learning rate algorithm and proved its convergence.  summarized this.
Although Lv  proposed an adaptive non-zero learning rate algorithm and proved its convergence. When we choose an inappropriate initial value will cause the algorithm to fail to iterate. This paper we proposed a generalized adaptive learning rate (GALR) PCA algorithm which can ensure the convergence of the algorithm.
2. Oja’s Algorithm, the DCT Method and the DDT Method
2.1. Oja’s Algorithm
Consider a linear single neuron network with input–output relation
where y(k) is the network output, the input sequence is a zero mean stationary stochastic process, each is a weight vector. The basic Oja’s PCA learning algorithm for weight updating is described as
for , where is the learning rate. It is very difficult to study the convergence of (2) directly.
2.2. DCT Method
To indirectly interpret the convergence of (2), traditionally, the DCT method is used. This method can be described as follows: applying a fundamental stochastic approximation theorem  to (2), a DCT system can be obtained
where denotes the correlation matrix of . The convergence of this DCT system is then used to interpret the convergence of (2). To transform (2) into the previous DCT system, the following restrictive conditions are required:
By the condition 1), clearly, it implies that as . However, in many practical applications, is often set to a small constant due to the round off limitation and speed requirement  . The condition 2) is also difficult to be satisfied . Thus, these conditions are unrealistic in practical applications and the previous DCT analysis is not directly applicable  .
2.3. DDT Method
To overcome the problem of learning rate converging toward zero, a DDT method is proposed in . Unlike the DCT method to transform (2) into a continuous time deterministic system, the DDT method transform (2) into a discrete time deterministic system. The DDT system can be formulated as follows. Applying the conditional expectation operator to (2) and identifying the conditional expected value as the next iterate in the system , a DDT system can be obtained and given as
for , where is the correlation matrix. The DDT method has some obvious advantages. First, it preserves the discrete time nature as that of the original learning algorithm. Second, it is not necessary to satisfy some unrealistic conditions of the DCT approach. Thus, it allows the learning rate to be constant. The main propose of this paper is to study the convergence of (4) subject to the learning rate being some constant.
 proved the convergence of (4) when satisfied certain conditions. And it is proved that when , the algorithm has the fastest convergence speed. According to the proof of   proposed an adaptive learning rate PCA algorithm and gave a theoretical proof:
3. The Generalized Adaptive Learning Rate (GALR) PCA Algorithm
In this section, details associated with a convergence analysis of the proposed learning algorithm will be provided systematically.
3.1. GALR PCA Algorithm Formulation
Although  has given the convergence proof of (5) under certain conditions, when the data is online, we cannot obtain the covariance matrix C in advance, which may lead to and affect the convergence process of the algorithm. We proposed a generalized adaptive learning rate (GALA) PCA algorithm
for , where is the correlation matrix and , are the coefficients and I is an identity matrix.
For convenience of analysis, next, some preliminaries are given. It is known that the correlation matrix C is a symmetric nonnegative definite matrix. There exists an orthonormal basis of composed by eigenvectors of C. Let be all the eigenvalues of C order by , Suppose that is the smallest nonzero eigenvalue, i.e., but . Denote by , the largest eigenvalue of C. Suppose that the multiplicity of is , then . Suppose that is an orthonormal basis in such that each is a unit eigenvector of C associated with the eigenvalue . Denote by the eigensubspace of the largest eigenvalue , i.e. Denoting by the subspace which is perpendicular to , clearly .
Since the vector set is an orthonormal basis of , for each , can be represented as
where are some constants, and
Substitute (7) to (6), we have
for , where .
Next, a lemma is presented that will be used for the subsequent algorithm analysis.
Lemma 1. It holds that
where and is a constant.
Its proof can be found in .
We will analyze the boundedness of (6). The lower and upper bounds will be given in the following theorems.
Lemma 2. Suppose that , if then is increasing.
Proof. If it can be checked that
for . Then, from (6), (8), (10), we have
Theorem 1. If , it holds that
for all , where .
Proof. It can be checked that
Similar to the proof of Lemma 2, we have
for . By Lemma 1, it holds that
The above theorem shows that if , the trajectory starting from is lower bounded, then it will never converge to zero.
Theorem 2. For any , there is a constant . The specific expression is
for all . Then is always true.
Proof. Let’s proved it by mathematical induction. For all , holds. Next, we proved holds too. Discuss in two situations:
Case 1: . Then
It follows that
Case 2: . Then
It follows that
By lemma 1 and lemma 2, denote
So is always true.
3.3. Global Convergence
Lemma 3. If there exists constants and such that
for all , where
Proof. Since there must exist some such that . Without loss of generality, assume that .
From (10), it follows that
By Theorem 2, has an upper bound for all . Given any , it holds that
for . Then from (25) and (26), for each , it follows that
for all .
Since is bounded, there exists a constant such that for all . Then
for , where
Next, for convenience, denote
for all . Clearly, and for all .
Lemma 4. It holds that
Proof. From (9) and (10), it follows that
Lemma 5. If , it holds that
Proof. From (30), it follows that
for . By Lemma 1, it holds that
Lemma 6. There exists a positive constant such that
Proof. From (31)
for . By Theorem1 and Lemma 3,
for , where
Lemma 7. If , then there exist constants and such that
for all , where
Proof. From (30), (31) and Lemma 4, it follows that
for . It follows that,
for . By Lemma5, it holds that
for . Denote
Clearly, if . Then
By Theorem 1 and Lemma 6,
for all .
Lemma 8. Suppose there exists constants and such that
for . Then,
where are constants.
Next, we can prove that each sequence is a Cauchy sequence. See  for the proof. By Cauchy Convergence Principle, there exist constants such that
Theorem 3. Suppose that , if then the trajectory of (4) starting from will converge to a linear combination of and coefficients . This linear combination is the eigenvector corresponding to the largest eigenvalue of its covariance matrix.
Proof. By Lemma 3, there exists constants and such that
for all . By Lemma 7, there exists constants and such that
for all .
Obviously, there exists constants and such that
Using Lemmas 3 and 8, it follows that
When the algorithm converges, we have
Consider the most common case where the largest eigenvalue has only one. After the algorithm converges, . We have . Finally, we can get the largest eigenvector
4. Numerical Experiment
Some examples will be provided in this section to illustrate the proposed theory.
4.1. Offline Data Experiment
The numerical experiment is mainly to observe the convergence of the algorithm, and we randomly generate a covariance matrix
According to the existing related methods, we can directly find C’s maximum eigenvalue is 1.778753, and the maximum eigenvector is [0.341311, 0.579619, 0.411713 0.336439, 0.203388, 0.472741]. Next, we select three different initial vectors, take and use as the stopping criterion (That is, in the iterative process of the algorithm, when the absolute value of each element of the i-th eigenvector minus each element of the i-1-th step is less than , the algorithm is considered to have converged and the iteration is stopped) to observe the convergence of the algorithm.
The first initial vector we randomly generated is [0.5488, 0.7152, 0.6028, 0.5449, 0.4237, 0.6459]. The trajectory of the eigenvector iteration and the learning rate iteration trajectory is shown in Figure 1. The algorithm converged after 23 iterations, converged to the maximum eigenvalue 1.778753, and the maximum eigenvector from the final iteration was [0.341436, 0.579398, 0.411745, 0.336571, 0.203655, 0.472685], and the learning rate converged to 0.281096. It can be seen that the error from the true maximum eigenvector is less than 0.0001.
The second initial vector we randomly generated is [0.0055, 0.0072, 0.006, 0.0054, 0.0042, 0.0065]. The trajectory of the eigenvector iteration and the learning rate iteration trajectory are shown in Figure 2. The algorithm converges after 27 iterations, converges to the maximum eigenvalue 1.778753, and the maximum eigenvector from the final iteration is [0.341423, 0.579428, 0.411735, 0.336547, 0.203620, 0.472697], and the learning rate converges to 0.281096. It can be seen that the error with the true maximum eigenvector is also less than 0.0001.
Figure 1. Convergence process of (6) with median initial vector.
Figure 2. Convergence process of (6) with smaller initial vector.
The third initial vector we randomly generated is [1142.75, 1458.86, 1245.25, 1135.28, 904.94, 1327.2]. The trajectory of the eigenvector iteration and the learning rate iteration trajectory are shown in Figure 3. The algorithm converges after 35 iterations, converges to the maximum eigenvalue 1.778753, and the maximum eigenvector from the final iteration is [0.341414, 0.579436, 0.411738, 0.336548, 0.203616, 0.472693], and the learning rate converges to 0.281096.It can be seen that the error with the real maximum eigenvector is still less than 0.0001.
We can see that the algorithm can converge to the maximum eigenvector faster from three different initial value experiments. Of course, the closer the initial vector is to the maximum eigenvector, the faster the convergence will be. When the initial value is vast, the initial learning rate is close to 0, and the vector will gradually come down to convergence. When the initial value is minimal, since the initial learning rate will be huge, the vector will have a more significant jump and gradually converge. When in the algorithm is determined, no matter how the initial vector is taken, the learning rate after the final algorithm converges is a fixed value, which is consistent with our theoretical proof.
Although in the theoretical proof, will converge, but after many numerical experiments, we found that an initial vector is chosen arbitrarily, the results will converge. This is because the probability that is almost 0. Then we select a low-dimensional covariance matrix with integer eigenvectors for experiment. We found that when , is not divergent, it will converge to other eigenvectors. Suppose the multiplicity of the largest eigenvalue (denoted by ) of C is , i.e., ; Suppose the multiplicity of the second largest eigenvalue (denoted by ) of C is , i.e., . Denote , when , , will converge to the largest eigenvector of C; when and , will converge to the second largest eigenvector of C.
Figure 3. Convergence process of (6) with larger initial vector.
4.2. Comparison Algorithm (5) and (6)
In order to compare the difference between (5) and (6), we did some experiments. We randomly generated an initial vector as [571.37, 729.43, 622.63, 567.64, 452.47, 663.6]. The two algorithms use the same . For (6) we use a = 0.001, b = 0.001. We still use as the stopping criterion. The results are shown in Figure 4 and Figure 5.
From the results, we can see that Algorithm (5) converges after 83 iterations, and Algorithm (6) converges after 69 iterations. Algorithm (6) can quickly converge to near the true value within the first ten iterations, while Algorithm (5) requires more iterations. We found that when we have a larger initial value, choosing smaller coefficients a, b can help the algorithm converge more quickly.
Figure 4. Convergence of algorithm (5) with larger initial vector.
Figure 5. Convergence of algorithm (6) with larger initial vector and smaller a, b.
4.3. Selection of Parameters a, b
The most significant difference between Algorithm (6) and Algorithm (5) is the newly introduced two parameters a and b. Next, we used numerical experiments to discuss the influence of parameters a and b on the algorithm’s convergence process.
We take [0.5488, 0.7152, 0.6028, 0.5449, 0.4237, 0.6459] as the initial vector, choose , and changea, b to observe the convergence of the algorithm (6). The number of iterations required by the algorithm (6) under different parameters is shown in Table 1.
We can see that when one parameter is much larger than the other, the smaller parameter will lose its effect. We must keep the two parameters in the same order of magnitude. This is because the covariance matrix C and the identity matrix I are in the same order of magnitude in the experiment. If one coefficient is larger, the effect of the other coefficient will be reduced.
From Table 1, we can see that when the convergence speed is the fastest. Next, we take a value in this interval, and the convergence rate hardly changes. This shows that the algorithm is not sensitive to changes of a and b in the same order of magnitude.
We use [0.0055, 0.0072, 0.006, 0.0054, 0.0042, 0.0065] as the initial vector, we found when the convergence speed is the fastest. Then we use [59.3932, 74.367, 64.2487, 59.0395, 48.1289, 68.1305] as the initial vector, we found when the convergence speed is the fastest. This shows that when we choose a larger initial vector, we should choose smaller a, b and vice versa. The above three situations are all satisfied . It means when we choose a and b satisfied , the convergence speed of the algorithm (6) will reach the fastest.
4.4. Online Data Experiment
Next, we consider the case of online data. Suppose the input sequence is a zero mean stationary stochastic process. Because the data come in one by one online, we cannot get the covariance matrix C in advance. According to  C can be expressed as
where is a coefficient, when comes from a stationary process, then , else . According to The Law of Large Numbers, converges with probability 1.
We randomly generate data and add a new data every iteration. We still use as the stopping criterion, and find that different initial vectors and different will affect the convergence speed of the algorithm. Experiments with different initial values are just like offline data.
We studied the influence of parameter on the convergence of the algorithm and selected the same initial value. The result is shown in Figure 6. It is found that the smaller the is, the faster the algorithm can converge.
Figure 6. Convergence of algorithms with different .
Table 1. Number of iterations with different a, b.
In this paper, we proposed a Generalized Adaptive Learning Rate (GALR) PCA Algorithm, which could be guaranteed that the algorithm’s convergence process will not be affected by the selection of the initial value. Using the DDT method, we gave the upper and lower bounds of the algorithm and proved the global convergence. We no longer need the learning rate to tend to zero. Numerical experiments have also verified our theory. We discussed the relationship between the initial vector and the parameters a, b. Our algorithm is effective for both online and offline data.
 Xu, L. and Yuille, A.L. (1995) Robust Principal Component Analysis by Self-Organizing Rules Based on Statistical Physics Approach. IEEE Transactions on Neural Networks, 6, 131-143.
 Wang, S., Liang, Y.L. and Ma, F. (1998) An Adaptive Robust PCA Neural Network. The 1998 IEEE International Joint Conference on Neural Networks Proceedings, Vol. 3, 2288-2293.
 Chen, T.P., Hua, Y.B. and Yan, W.-Y. (1998) Global Convergence of Oja’s Subspace Algorithm for Principal Component Extraction. IEEE Transactions on Neural Networks, 9, 58-67.
 Zhang, Q. and Bao, Z. (1995) Dynamical Systems for Computing the Eigenvectors Associated with the Largest Eigenvalue of a Positive Definite Matrix. IEEE Transactions on Neural Networks, 6, 790-791.
 Zhang, Q. and Leung, Y.-W. (2000) A Class of Learning Algorithms for Principal Component Analysis and Minor Component Analysis. IEEE Transactions on Neural Networks, 11, 529-533.
 Yi, Z., Ye, M., Lv, J.C. and Tan, K.K. (2005) Convergence Analysis of a Deterministic Discrete Time System of Oja’s PCA Learning Algorithm. IEEE Transactions on Neural Networks, 16, 1318-1328.
 Jian, C.L., Zhang, Y. and Tan, K.K. (2006) Global Convergence of Oja’s PCA Learning Algorithm with a Non-Zero-Approaching Adaptive Learning Rate. Theoretical Computer Science, 367, 286-307.