Back
 APM  Vol.11 No.5 , May 2021
Convergence Analysis of a Kind of Deterministic Discrete-Time PCA Algorithm
Abstract: We proposed a generalized adaptive learning rate (GALR) PCA algorithm, which could be guaranteed that the algorithm’s convergence process would not be affected by the selection of the initial value. Using the deterministic discrete time (DDT) method, we gave the upper and lower bounds of the algorithm and proved the global convergence. Numerical experiments had also verified our theory, and the algorithm is effective for both online and offline data. We found that choosing different initial vectors will affect the convergence speed, and the initial vector could converge to the second or third eigenvectors by satisfying some exceptional conditions.

1. Introduction

At present, Principal Component Analysis is widely used in data processing, image recognition and even video recognition [1]. In general, QR decomposition and SVD decomposition are most commonly used when calculating principal components. However, when the data dimension is large, the calculation cost of the traditional decomposition matrix method is very high. And the approach is not suitable for real applications where data come incrementally or in the on-line way. The PCA algorithm can overcome these problems.

Regarding the PCA algorithm, the Hebbian learning rules were first proposed by Hebbian, but the Hebbian rules are unstable and easy to diverge. Oja [2] added a delayed item under the Hebbian learning rule premise, called Oja’s learning rule. The Oja’s rule could always converge to the principal component of the covariance matrix.

Many optimization methods could also obtain the principal components. Xu [3] proposed the LMSER PCA algorithm to extract the principal components based on the least mean square error reconstruction. Chatterjee [4] proposed an unconstrained objective function and obtained various new PCA adaptive algorithms using gradient descent method, steepest descent method, conjugate gradient method, and Newton method. The results showed that other methods converge faster than the gradient descent method. When there are outliers in the data, Xu [5] used a Gibbs distribution derivation, added a coefficient to the learning rate to detect outliers, and adjusted their weights called the robust PCA algorithm. Then Song [6] used a Cauchy distribution to redefine the weights. Yang [7] proposed the FRPCA algorithm, which automatically selects relevant parameters compared to Xu [5].

All of these PCA learning algorithms are described by stochastic discrete time systems, and convergence is crucial for these algorithms to be useful in applications. It is not easy to study the behavior of these PCA algorithms directly. In order to indirectly prove the convergence of the Oja’s learning algorithm, traditionally, the algorithms are first transformed into some deterministic continuous time (DCT) systems [8] [9] [10]. This transformation is based on a fundamental theorem for the study of the stochastic approximation algorithm [10], relating the discrete time stochastic model to a DCT formulation. The stochastic approximation theory requires strict assumptions, including the algorithm’s learning rate to converge to 0. However, these conditions are difficult to achieve in practice and affect convergence speed [11] [12]. This makes it challenging to apply the DCT method to analyze the PCA algorithm’s convergence in practice. Subsequently, Zufiria [12] proposed to study Oja’s algorithm through a deterministic discrete time (DDT) system. This DDT system is obtained by applying conditional expectation to Oja’s learning algorithm. It preserves the original learning algorithm’s discrete time nature and can shed some light on the characteristics of the constant learning rate. Zhang [13] proposed that when η = 0.618 λ 1 , the convergence speed is the fastest. Based on this, Lv [14] proposed an adaptive non-zero learning rate algorithm and proved its convergence. [15] summarized this.

Although Lv [14] proposed an adaptive non-zero learning rate algorithm and proved its convergence. When we choose an inappropriate initial value will cause the algorithm to fail to iterate. This paper we proposed a generalized adaptive learning rate (GALR) PCA algorithm which can ensure the convergence of the algorithm.

2. Oja’s Algorithm, the DCT Method and the DDT Method

2.1. Oja’s Algorithm

Consider a linear single neuron network with input–output relation

y ( k ) = w T ( k ) x ( k ) , ( k = 0 , 1 , 2 , ) (1)

where y(k) is the network output, the input sequence { x ( k ) | x ( k ) R n ( k = 0 , 1 , 2 , ) } is a zero mean stationary stochastic process, each w ( k ) R n ( k = 0 , 1 , 2 , ) is a weight vector. The basic Oja’s PCA learning algorithm for weight updating is described as

w ( k + 1 ) = w ( k ) + η y ( k ) [ x ( k ) y ( k ) w ( k ) ] (2)

for k 0 , where η ( k ) > 0 is the learning rate. It is very difficult to study the convergence of (2) directly.

2.2. DCT Method

To indirectly interpret the convergence of (2), traditionally, the DCT method is used. This method can be described as follows: applying a fundamental stochastic approximation theorem [8] to (2), a DCT system can be obtained

d w ( t ) d t = C w ( t ) w T C w ( t ) w ( t ) , t 0 (3)

where C = E [ x ( k ) x T ( k ) ] denotes the correlation matrix of { x ( k ) | x ( k ) R n ( k = 0 , 1 , 2 , ) } . The convergence of this DCT system is then used to interpret the convergence of (2). To transform (2) into the previous DCT system, the following restrictive conditions are required:

1) k = 1 + η 2 ( k ) < + ;

2) sup k E { | y ( k ) [ x ( k ) y ( k ) w ( k ) ] | 2 } < + .

By the condition 1), clearly, it implies that η ( k ) 0 as k + . However, in many practical applications, η ( k ) is often set to a small constant due to the round off limitation and speed requirement [11] [12]. The condition 2) is also difficult to be satisfied [12]. Thus, these conditions are unrealistic in practical applications and the previous DCT analysis is not directly applicable [11] [12].

2.3. DDT Method

To overcome the problem of learning rate converging toward zero, a DDT method is proposed in [12]. Unlike the DCT method to transform (2) into a continuous time deterministic system, the DDT method transform (2) into a discrete time deterministic system. The DDT system can be formulated as follows. Applying the conditional expectation operator E { w ( k + 1 ) / w ( 0 ) , x ( i ) , i < n } to (2) and identifying the conditional expected value as the next iterate in the system [11], a DDT system can be obtained and given as

w ( k + 1 ) = w ( k ) + η [ C w ( k ) w T ( k ) C w ( k ) w ( k ) ] (4)

for k 0 , where C = E [ x ( k ) x T ( k ) ] is the correlation matrix. The DDT method has some obvious advantages. First, it preserves the discrete time nature as that of the original learning algorithm. Second, it is not necessary to satisfy some unrealistic conditions of the DCT approach. Thus, it allows the learning rate to be constant. The main propose of this paper is to study the convergence of (4) subject to the learning rate η being some constant.

[13] proved the convergence of (4) when η , λ 1 , w ( 0 ) satisfied certain conditions. And it is proved that when η = 0.618 λ 1 , the algorithm has the fastest convergence speed. According to the proof of [13] [14] proposed an adaptive learning rate PCA algorithm and gave a theoretical proof:

w ( k + 1 ) = w ( k ) + ξ w T ( k ) C w ( k ) ( C w ( k ) w T ( k ) C w ( k ) w ( k ) ) (5)

3. The Generalized Adaptive Learning Rate (GALR) PCA Algorithm

In this section, details associated with a convergence analysis of the proposed learning algorithm will be provided systematically.

3.1. GALR PCA Algorithm Formulation

Although [14] has given the convergence proof of (5) under certain conditions, when the data is online, we cannot obtain the covariance matrix C in advance, which may lead to w T ( 0 ) C w ( 0 ) = 0 and affect the convergence process of the algorithm. We proposed a generalized adaptive learning rate (GALA) PCA algorithm

w ( k + 1 ) = w ( k ) + ξ w T ( k ) A w ( k ) ( C w ( k ) w T ( k ) A w ( k ) w ( k ) ) (6)

for k 0 , where C = E [ x ( k ) x T ( k ) ] , 0 < ξ < 0.8 is the correlation matrix and A = a C + b I , a , b are the coefficients and I is an identity matrix.

For convenience of analysis, next, some preliminaries are given. It is known that the correlation matrix C is a symmetric nonnegative definite matrix. There exists an orthonormal basis of R n composed by eigenvectors of C. Let λ 1 , , λ n be all the eigenvalues of C order by λ 1 λ n 0 , Suppose that λ p is the smallest nonzero eigenvalue, i.e., λ p > 0 but λ j = 0 ( j = p + 1 , , n ) . Denote by σ , the largest eigenvalue of C. Suppose that the multiplicity of σ is m ( 1 m p n ) , then σ = λ 1 = = λ m . Suppose that { v i | i = 1 , , n } is an orthonormal basis in R n such that each v i is a unit eigenvector of C associated with the eigenvalue λ i . Denote by V σ the eigensubspace of the largest eigenvalue σ , i.e. V σ = s p a n { v 1 , , v m } Denoting by V σ T the subspace which is perpendicular to V σ , clearly V σ T = s p a n { v m + 1 , , v n } .

Since the vector set { v 1 , , v n } is an orthonormal basis of R n , for each k 0 , w ( k ) R n can be represented as

w ( k ) = i = 1 n z i ( k ) v i (7)

where z i ( k ) ( i = 1 , , n ) are some constants, and

C w ( k ) = i = 1 n λ i z i ( k ) v i (8)

w T ( k ) A w ( k ) = a i = 1 n λ i z i 2 ( k ) + b i = 1 n z i 2 ( k ) = i = 1 n ( a λ i + b ) z i 2 ( k ) (9)

Substitute (7) to (6), we have

z i ( k + 1 ) = [ 1 + ξ w T ( k ) A w ( k ) ( λ i w T ( k ) A w ( k ) ) ] z i ( k ) , ( i = 1 , , n ) (10)

for k 0 , where 0 < ξ < 0.8 .

Next, a lemma is presented that will be used for the subsequent algorithm analysis.

Lemma 1. It holds that

{ [ 1 + ξ ( σ s 1 ) ] 2 s 4 ( 1 ξ ) ξ σ , s > 0 [ 1 + ξ ( σ s 1 ) ] 2 s max { σ , [ 1 + ξ ( σ s 1 ) ] 2 s } , s [ s , σ ]

where 0 < ξ < 0.8 and s * > 0 is a constant.

Its proof can be found in [14].

3.2. Boundedness

We will analyze the boundedness of (6). The lower and upper bounds will be given in the following theorems.

Lemma 2. Suppose that 0 < ξ < 0.8 , if 0 < w T ( k ) A w ( k ) < λ p then w T ( k ) A w ( k ) is increasing.

Proof. If w T ( k ) A w ( k ) < λ p it can be checked that

[ 1 + ξ ( λ i w T ( k ) A w ( k ) 1 ) ] 2 [ 1 + ξ ( λ p w T ( k ) A w ( k ) 1 ) ] 2 1 (11)

for k 0 . Then, from (6), (8), (10), we have

w T ( k + 1 ) A w ( k + 1 ) = i = 1 n ( a λ i + b ) z i 2 ( k + 1 )               = i = 1 n [ 1 + ξ ( λ i w T ( k ) A w ( k ) 1 ) ] 2 ( a λ i + b ) z i 2 ( k )               = i = 1 p [ 1 + ξ ( λ i w T ( k ) A w ( k ) 1 ) ] 2 ( a λ i + b ) z i 2 ( k )               i = 1 p [ 1 + ξ ( λ p w T ( k ) A w ( k ) 1 ) ] 2 ( a λ i + b ) z i 2 ( k )               [ 1 + ξ ( λ p w T ( k ) A w ( k ) 1 ) ] 2 w T ( k ) A w ( k )               w T ( k ) A w ( k ) (12)

Theorem 1. If w T ( 0 ) A w ( 0 ) > 0 , it holds that

w T ( k + 1 ) A w ( k + 1 ) 4 ( 1 ξ ) ξ λ p > 0 (13)

for all k 0 , where 0 < ξ < 0.8 .

Proof. It can be checked that

[ 1 + ξ ( λ i w T ( k ) A w ( k ) 1 ) ] 2 [ 1 + ξ ( λ p w T ( k ) A w ( k ) 1 ) ] 2 0 (14)

Similar to the proof of Lemma 2, we have

w T ( k + 1 ) A w ( k + 1 ) [ 1 + ξ ( λ p w T ( k ) A w ( k ) 1 ) ] 2 w T ( k ) A w ( k ) min s > 0 { [ 1 + ξ ( λ p s 1 ) ] 2 s } (15)

for k 0 . By Lemma 1, it holds that

w T ( k + 1 ) A w ( k + 1 ) 4 ( 1 ξ ) ξ λ p > 0

The above theorem shows that if w T ( 0 ) A w ( 0 ) 0 , the trajectory starting from w ( 0 ) is lower bounded, then it will never converge to zero.

Theorem 2. For any w ( 0 ) , there is a constant H ( w ( 0 ) ) . The specific expression is

H ( w ( 0 ) ) = { σ , [ 1 + ξ ( σ w T ( 0 ) A w ( 0 ) 1 ) ] 2 w T ( 0 ) A w ( 0 ) } (16)

for all k 0 . Then w T ( k ) A w ( k ) H ( w ( 0 ) ) is always true.

Proof. Let’s proved it by mathematical induction. For all k 0 , 0 < w T ( k ) A w ( k ) H ( w ( 0 ) ) holds. Next, we proved 0 < w T ( k + 1 ) A w ( k + 1 ) H ( w ( 0 ) ) holds too. Discuss in two situations:

Case 1: σ w T ( k ) A w ( k ) H ( w ( 0 ) ) . Then

[ 1 + ξ ( λ i w T ( k ) A w ( k ) 1 ) ] 2 1 , ( i = 1 , , p ) (17)

It follows that

w T ( k + 1 ) A w ( k + 1 ) = i = 1 p [ 1 + ξ ( λ i w T ( k ) A w ( k ) 1 ) ] 2 ( a λ i + b ) z i 2 ( k ) i = 1 p ( a λ i + b ) z i 2 ( k ) = w T ( k ) A w ( k ) H ( w ( 0 ) ) (18)

Case 2: 0 < w T ( k ) A w ( k ) σ . Then

[ 1 + ξ ( λ i w T ( k ) A w ( k ) 1 ) ] 2 [ 1 + ξ ( σ w T ( k ) A w ( k ) 1 ) ] 2 (19)

It follows that

w T ( k + 1 ) A w ( k + 1 ) = i = 1 p [ 1 + ξ ( λ i w T ( k ) A w ( k ) 1 ) ] 2 ( a λ i + b ) z i 2 ( k )               i = 1 p [ 1 + ξ ( σ w T ( k ) A w ( k ) 1 ) ] 2 ( a λ i + b ) z i 2 ( k )              = [ 1 + ξ ( σ w T ( k ) A w ( k ) 1 ) ] 2 w T ( k ) A w ( k ) max 0 < s σ { [ 1 + ξ ( σ s 1 ) ] 2 s } (20)

By lemma 1 and lemma 2, denote s = ξ σ 1 ξ

When s > σ

w T ( k + 1 ) A w ( k + 1 ) [ 1 + ξ ( σ w T ( 0 ) A w ( 0 ) 1 ) ] 2 w T ( 0 ) A w ( 0 ) (21)

When s < σ

w T ( k + 1 ) A w ( k + 1 ) { σ , s < s < σ [ 1 + ξ ( σ w T ( 0 ) A w ( 0 ) 1 ) ] 2 w T ( 0 ) A w ( 0 ) , 0 < s < s (22)

So w T ( k + 1 ) A w ( k + 1 ) H ( w ( 0 ) ) is always true.

3.3. Global Convergence

Lemma 3. If w ( 0 ) V σ there exists constants θ 1 > 0 and Π 1 0 such that

j = m + 1 n z j 2 ( k ) Π 1 e θ 1 k (23)

for all k 0 , where

θ 1 = ln ( ξ σ + ( 1 ξ ) H ( w ( 0 ) ) ξ λ m + 1 + ( 1 ξ ) H ( w ( 0 ) ) ) 2 > 0 (24)

Proof. Since w ( 0 ) V σ there must exist some i ( 1 i m ) such that z i ( 0 ) 0 . Without loss of generality, assume that z 1 ( 0 ) 0 .

From (10), it follows that

z i ( k + 1 ) = [ 1 + ξ ( σ w T ( k ) A w ( k ) 1 ) ] z i ( k ) , ( i = 1 , , m ) (25)

z j ( k + 1 ) = [ 1 + ξ ( λ j w T ( k ) A w ( k ) 1 ) ] z j ( k ) , ( j = m + 1 , , n ) (26)

By Theorem 2, w T ( k ) A w ( k ) has an upper bound for all k 0 . Given any i , ( i = 1 , , n ) , it holds that

[ 1 + ξ ( λ i w T ( k ) A w ( k ) 1 ) ] = 1 ξ + ξ ( λ i w T ( k ) A w ( k ) ) ( 1 ξ ) > 0 (27)

for k 1 . Then from (25) and (26), for each j ( j = m + 1 , , n ) , it follows that

[ z j ( k + 1 ) z 1 ( k + 1 ) ] 2 = [ 1 + ξ ( λ j w T ( k ) A w ( k ) 1 ) 1 + ξ ( σ w T ( k ) A w ( k ) 1 ) ] 2 [ z j ( k ) z 1 ( k ) ] 2 = [ w T ( k ) A w ( k ) + ξ ( λ j w T ( k ) A w ( k ) ) w T ( k ) A w ( k ) + ξ ( σ w T ( k ) A w ( k ) ) ] 2 [ z j ( k ) z 1 ( k ) ] 2 [ ξ λ j + ( 1 ξ ) H ( w ( 0 ) ) ξ σ + ( 1 ξ ) H ( w ( 0 ) ) ] 2 [ z j ( k ) z 1 ( k ) ] 2 [ ξ λ m + 1 + ( 1 ξ ) H ( w ( 0 ) ) ξ σ + ( 1 ξ ) H ( w ( 0 ) ) ] 2 [ z j ( k ) z 1 ( k ) ] 2 = [ z j ( 0 ) z 1 ( 0 ) ] 2 e θ 1 ( k + 1 ) (28)

for all k 1 .

Since w T ( k ) A w ( k ) is bounded, there exists a constant d > 0 such that z 1 2 ( k ) d for all k 0 . Then

j = m + 1 n z j 2 ( k ) = j = m + 1 n [ z j ( k ) z 1 ( k ) ] 2 z 1 2 ( k ) Π 1 e θ 1 k (29)

for k 1 , where Π 1 = d j = m + 1 n [ z j ( 0 ) z 1 ( 0 ) ] 2 0

Next, for convenience, denote

P ( k ) = [ 1 + ξ ( σ w T ( k ) A w ( k ) 1 ) ] 2 w T ( k ) A w ( k ) (30)

Q ( k ) = i = m + 1 n ( a λ i + b ) [ 2 ( 1 ξ ) + ξ ( λ i + σ ) w T ( k ) A w ( k ) ] [ ξ ( σ λ i ) w T ( k ) A w ( k ) ] z i 2 ( k ) (31)

for all k 0 . Clearly, P ( k ) 0 and Q ( k ) 0 for all k 0 .

Lemma 4. It holds that

w T ( k + 1 ) A w ( k + 1 ) = P ( k ) Q ( k ) (32)

for k 0 .

Proof. From (9) and (10), it follows that

w T ( k + 1 ) A w ( k + 1 ) = i = 1 n [ 1 + ξ ( λ i w T ( k ) A w ( k ) 1 ) ] 2 ( a λ i + b ) z i 2 ( k ) = i = 1 n [ 1 + ξ ( σ w T ( k ) A w ( k ) 1 ) ] 2 ( a λ i + b ) z i 2 ( k ) i = m + 1 n ( a λ i + b ) [ 2 ( 1 ξ ) + ξ ( λ i + σ ) w T ( k ) A w ( k ) ] [ ξ ( σ λ i ) w T ( k ) A w ( k ) ] z i 2 ( k ) = [ 1 + ξ ( σ w T ( k ) A w ( k ) 1 ) ] 2 w T ( k ) A w ( k ) i = m + 1 n ( a λ i + b ) [ 2 ( 1 ξ ) + ξ ( λ i + σ ) w T ( k ) A w ( k ) ] [ ξ ( σ λ i ) w T ( k ) A w ( k ) ] z i 2 ( k ) = P ( k ) Q ( k ) (33)

Lemma 5. If w ( 0 ) V σ , it holds that

P ( k 1 ) 4 ( 1 ξ ) ξ σ (34)

Proof. From (30), it follows that

P ( k 1 ) = [ 1 + ξ ( σ w T ( k 1 ) A w ( k 1 ) 1 ) ] 2 w T ( k 1 ) A w ( k 1 ) min s > 0 { [ 1 + ξ ( σ s 1 ) ] 2 s } (35)

for k 1 . By Lemma 1, it holds that

P ( k 1 ) 4 ( 1 ξ ) ξ σ , k 1

Lemma 6. There exists a positive constant Π 2 > 0 such that

Q ( k ) Π 2 e θ 1 k (36)

for k 0 .

Proof. From (31)

Q ( k ) = i = m + 1 n ( a λ i + b ) [ 2 ( 1 ξ ) + ξ ( λ i + σ ) w T ( k ) A w ( k ) ] [ ξ ( σ λ i ) w T ( k ) A w ( k ) ] z i 2 ( k ) i = m + 1 n ( a λ i + b ) [ 2 + 2 ξ σ w T ( k ) A w ( k ) ] [ ξ σ w T ( k ) A w ( k ) ] z i 2 ( k ) 2 [ 1 + ξ σ w T ( k ) A w ( k ) ] [ ξ σ w T ( k ) A w ( k ) ] i = m + 1 n ( a λ i + b ) z i 2 ( k ) (37)

for k 0 . By Theorem1 and Lemma 3,

Q ( k ) 2 [ 1 + ξ σ 4 ( 1 ξ ) ξ λ p ] [ ξ σ 4 ( 1 ξ ) ξ λ p ] i = m + 1 n ( a λ i + b ) z i 2 ( k ) 2 Π 1 [ 1 + σ 4 ( 1 ξ ) λ p ] [ σ 4 ( 1 ξ ) λ p ] e θ 1 k Π 2 e θ 1 k (38)

for k 0 , where

Π 2 = 2 Π 1 [ 1 + σ 4 ( 1 ξ ) λ p ] [ σ 4 ( 1 ξ ) λ p ]

Lemma 7. If w ( 0 ) V σ , then there exist constants θ 2 > 0 and Π 5 > 0 such that

| σ w T ( k + 1 ) A w ( k + 1 ) | k Π 5 [ { e θ 2 ( k + 1 ) + max { e θ 2 k , e θ 1 k } } ] (39)

for all k 1 , where

{ θ 2 = ln δ δ = max { ( 1 ξ ) 2 , ξ 4 ( 1 ξ ) } (40)

and 0 < ξ < 0.8 , 0 < δ < 1 .

Proof. From (30), (31) and Lemma 4, it follows that

σ w T ( k + 1 ) A w ( k + 1 ) = σ P ( k ) + Q ( k ) = σ [ 1 + ξ ( σ w T ( k ) A w ( k ) 1 ) ] 2 w T ( k ) A w ( k ) + Q ( k ) = ( σ w T ( k ) A w ( k ) ) [ ( 1 ξ ) 2 ξ 2 σ w T ( k ) A w ( k ) ] + Q ( k ) = ( σ w T ( k ) A w ( k ) ) [ ( 1 ξ ) 2 ξ 2 σ P ( k 1 ) Q ( k 1 ) ] + Q ( k ) = ( σ w T ( k ) A w ( k ) ) [ ( 1 ξ ) 2 ξ 2 σ P ( k 1 ) ]

( σ w T ( k ) A w ( k ) ) ξ 2 σ Q ( k 1 ) P ( k 1 ) ( P ( k 1 ) Q ( k 1 ) ) + Q ( k ) = ( σ w T ( k ) A w ( k ) ) [ ( 1 ξ ) 2 ξ 2 σ P ( k 1 ) ] ( σ w T ( k ) A w ( k ) ) ξ 2 σ P ( k 1 ) w T ( k ) A w ( k ) Q ( k 1 ) + Q ( k ) = ( σ w T ( k ) A w ( k ) ) [ ( 1 ξ ) 2 ξ 2 σ P ( k 1 ) ] ξ 2 σ [ σ w T ( k ) A w ( k ) 1 ] Q ( k 1 ) P ( k 1 ) + Q ( k ) (41)

for k 1 .

Denote

V ( k ) = | σ w T ( k ) A w ( k ) | (42)

for k 1 . It follows that,

V ( k + 1 ) V ( k ) | ( 1 ξ ) 2 ξ 2 σ P ( k 1 ) | + ξ 2 σ [ σ w T ( k ) A w ( k ) + 1 ] Q ( k 1 ) P ( k 1 ) + Q ( k ) max { ( 1 ξ ) 2 , ξ 2 σ P ( k 1 ) } V ( k ) + ξ 2 σ [ σ w T ( k ) A w ( k ) + 1 ] Q ( k 1 ) P ( k 1 ) + Q ( k ) (43)

for k 1 . By Lemma5, it holds that

V ( k + 1 ) max { ( 1 ξ ) 2 , ξ 4 ( 1 ξ ) } V ( k ) + ξ 4 ( 1 ξ ) [ σ w T ( k ) A w ( k ) + 1 ] Q ( k 1 ) + Q ( k ) (44)

for k 1 . Denote

δ = max { ( 1 ξ ) 2 , ξ 4 ( 1 ξ ) } (45)

Clearly, if 0 < ξ < 0.8 , 0 < δ < 1 . Then

V ( k + 1 ) δ V ( k ) + ξ 4 ( 1 ξ ) [ σ w T ( k ) A w ( k ) + 1 ] Q ( k 1 ) + Q ( k ) (46)

Denote

Π 3 = ξ 4 ( 1 ξ ) [ σ 4 ( 1 ξ ) ξ λ p + 1 ] (47)

By Theorem 1 and Lemma 6,

V ( k + 1 ) δ V ( k ) + Π 3 Q ( k 1 ) + Q ( k ) δ V ( k ) + Π 4 e θ 1 k (48)

where

Π 4 = Π 2 ( Π 3 e θ 1 + 1 ) (49)

Denote.

{ θ 2 = ln δ δ = max { ( 1 ξ ) 2 , ξ 4 ( 1 ξ ) } (50)

Then,

V ( k + 1 ) δ k + 1 V ( 0 ) + Π 4 r = 0 k δ r e θ 1 ( k r ) δ k + 1 V ( 0 ) + k Π 4 max { δ k , e θ 1 k } k Π 5 [ e θ 2 ( k + 1 ) + max { e θ 2 k , e θ 1 k } ] (51)

for all k 1 .

Lemma 8. Suppose there exists constants θ > 0 and Π > 0 such that

ξ w T ( k ) A w ( k ) | ( σ w T ( k ) A w ( k ) ) z i ( k ) | k Π e θ k , ( i = 1 , , m ) (52)

for k 1 . Then,

lim k z i ( k ) = z i , ( i = 1 , , m ) (53)

where z i ( i = 1 , , m ) are constants.

Proof. When z i ( k ) 1

ξ w T ( k ) A w ( k ) | ( σ w T ( k ) A w ( k ) ) z i ( k ) | ξ w T ( k ) A w ( k ) | ( σ w T ( k ) A w ( k ) ) | k Π e θ k (54)

when z i ( k ) > 1

ξ w T ( k ) A w ( k ) | ( σ w T ( k ) A w ( k ) ) z i ( k ) | ξ z i ( k ) i = 1 m ( a σ + b ) z i ( k ) 2 | ( σ w T ( k ) A w ( k ) ) | ξ ( a σ + b ) | ( σ w T ( k ) A w ( k ) ) | k Π e θ k (55)

Next, we can prove that each sequence { z i ( k ) } is a Cauchy sequence. See [14] for the proof. By Cauchy Convergence Principle, there exist constants z i ( i = 1 , , m ) such that

lim k z i ( k ) = z i , ( i = 1 , , m )

Theorem 3. Suppose that 0 < ξ < 0.8 , if w ( 0 ) V σ then the trajectory of (4) starting from w ( 0 ) will converge to a linear combination of σ and coefficients a , b . This linear combination is the eigenvector corresponding to the largest eigenvalue of its covariance matrix.

Proof. By Lemma 3, there exists constants θ 1 > 0 and Π 1 0 such that

j = m + 1 n z j 2 ( k ) Π 1 e θ 1 k (56)

for all k 0 . By Lemma 7, there exists constants θ 2 > 0 and Π 2 > 0 such that

| σ w T ( k + 1 ) A w ( k + 1 ) | k Π 2 [ e θ 2 ( k + 1 ) + max { e θ 2 k , e θ 1 k } ] (57)

for all k 0 .

Obviously, there exists constants θ > 0 and Π > 0 such that

ξ w T ( k ) A w ( k ) | ( σ w T ( k ) A w ( k ) ) z i ( k ) | k Π e θ k (58)

for k 0 and i ( i = 1 , , m )

Using Lemmas 3 and 8, it follows that

{ lim k z i ( k ) = z i , ( i = 1 , , m ) lim k z i ( k ) = 0 , ( i = m + 1 , , n ) (59)

Then,

lim k w ( k ) = i = 1 m z i v i V σ (60)

When the algorithm converges, we have

lim k C w ( k ) = lim k w T ( k ) A w ( k ) w ( k ) (61)

i = 1 m σ z i v i = i = 1 m ( a σ + b ) ( z i ) 2 i = 1 m z i v i (62)

i = 1 m ( z i ) 2 = σ a σ + b (63)

Consider the most common case where the largest eigenvalue has only one. After the algorithm converges, ( z 1 ) 2 = σ / ( a σ + b ) . We have w = [ σ / ( a σ + b ) ] 1 / 2 v . Finally, we can get the largest eigenvector

v = w / [ σ / ( a σ + b ) ] 1 / 2 = w / [ w T A w / ( a w T A w + b ) ] 1 / 2

4. Numerical Experiment

Some examples will be provided in this section to illustrate the proposed theory.

4.1. Offline Data Experiment

The numerical experiment is mainly to observe the convergence of the algorithm, and we randomly generate a covariance matrix

C = [ 1.090719 0.154061 0.109432 0.089424 0.05406 0.125653 0.154061 1.261628 0.185839 0.151862 0.091805 0.213386 0.109432 0.185839 1.132004 0.10787 0.065211 0.151571 0.089424 0.151862 0.10787 1.088148 0.053288 0.12386 0.05406 0.091805 0.065211 0.053288 1.032214 0.074877 0.125653 0.213386 0.151571 0.12386 0.074877 1.174038 ]

According to the existing related methods, we can directly find C’s maximum eigenvalue is 1.778753, and the maximum eigenvector is [0.341311, 0.579619, 0.411713 0.336439, 0.203388, 0.472741]. Next, we select three different initial vectors, take ξ = 0.5 , a = 0.5 , b = 0.5 and use ε = 0.0001 as the stopping criterion (That is, in the iterative process of the algorithm, when the absolute value of each element of the i-th eigenvector minus each element of the i-1-th step is less than ε , the algorithm is considered to have converged and the iteration is stopped) to observe the convergence of the algorithm.

The first initial vector we randomly generated is [0.5488, 0.7152, 0.6028, 0.5449, 0.4237, 0.6459]. The trajectory of the eigenvector iteration and the learning rate iteration trajectory is shown in Figure 1. The algorithm converged after 23 iterations, w T A w converged to the maximum eigenvalue 1.778753, and the maximum eigenvector from the final iteration was [0.341436, 0.579398, 0.411745, 0.336571, 0.203655, 0.472685], and the learning rate converged to 0.281096. It can be seen that the error from the true maximum eigenvector is less than 0.0001.

The second initial vector we randomly generated is [0.0055, 0.0072, 0.006, 0.0054, 0.0042, 0.0065]. The trajectory of the eigenvector iteration and the learning rate iteration trajectory are shown in Figure 2. The algorithm converges after 27 iterations, w T A w converges to the maximum eigenvalue 1.778753, and the maximum eigenvector from the final iteration is [0.341423, 0.579428, 0.411735, 0.336547, 0.203620, 0.472697], and the learning rate converges to 0.281096. It can be seen that the error with the true maximum eigenvector is also less than 0.0001.

Figure 1. Convergence process of (6) with median initial vector.

Figure 2. Convergence process of (6) with smaller initial vector.

The third initial vector we randomly generated is [1142.75, 1458.86, 1245.25, 1135.28, 904.94, 1327.2]. The trajectory of the eigenvector iteration and the learning rate iteration trajectory are shown in Figure 3. The algorithm converges after 35 iterations, w T A w converges to the maximum eigenvalue 1.778753, and the maximum eigenvector from the final iteration is [0.341414, 0.579436, 0.411738, 0.336548, 0.203616, 0.472693], and the learning rate converges to 0.281096.It can be seen that the error with the real maximum eigenvector is still less than 0.0001.

We can see that the algorithm can converge to the maximum eigenvector faster from three different initial value experiments. Of course, the closer the initial vector is to the maximum eigenvector, the faster the convergence will be. When the initial value is vast, the initial learning rate is close to 0, and the vector will gradually come down to convergence. When the initial value is minimal, since the initial learning rate will be huge, the vector will have a more significant jump and gradually converge. When ξ in the algorithm is determined, no matter how the initial vector is taken, the learning rate after the final algorithm converges is a fixed value, which is consistent with our theoretical proof.

Although in the theoretical proof, w ( 0 ) V σ will converge, but after many numerical experiments, we found that an initial vector w ( 0 ) is chosen arbitrarily, the results will converge. This is because the probability that w ( 0 ) V σ is almost 0. Then we select a low-dimensional covariance matrix with integer eigenvectors for experiment. We found that when w ( 0 ) V σ , w ( k ) is not divergent, it will converge to other eigenvectors. Suppose the multiplicity of the largest eigenvalue (denoted by σ ) of C is m ( 1 m n ) , i.e., λ 1 = λ 2 = = λ m = σ ; Suppose the multiplicity of the second largest eigenvalue (denoted by τ ) of C is t ( m < t n ) , i.e., λ m + 1 = λ m + 2 = = λ m + t = τ . Denote V σ = s p a n { v m + 1 , , v n } , V τ = s p a n { v m + τ + 1 , , v n } when w ( 0 ) V σ , w ( k ) , will converge to the largest eigenvector of C; when w ( 0 ) V σ and w ( 0 ) V τ , w ( k ) will converge to the second largest eigenvector of C.

Figure 3. Convergence process of (6) with larger initial vector.

4.2. Comparison Algorithm (5) and (6)

In order to compare the difference between (5) and (6), we did some experiments. We randomly generated an initial vector as [571.37, 729.43, 622.63, 567.64, 452.47, 663.6]. The two algorithms use the same ξ = 0.2 . For (6) we use a = 0.001, b = 0.001. We still use ε = 0.0001 as the stopping criterion. The results are shown in Figure 4 and Figure 5.

From the results, we can see that Algorithm (5) converges after 83 iterations, and Algorithm (6) converges after 69 iterations. Algorithm (6) can quickly converge to near the true value within the first ten iterations, while Algorithm (5) requires more iterations. We found that when we have a larger initial value, choosing smaller coefficients a, b can help the algorithm converge more quickly.

Figure 4. Convergence of algorithm (5) with larger initial vector.

Figure 5. Convergence of algorithm (6) with larger initial vector and smaller a, b.

4.3. Selection of Parameters a, b

The most significant difference between Algorithm (6) and Algorithm (5) is the newly introduced two parameters a and b. Next, we used numerical experiments to discuss the influence of parameters a and b on the algorithm’s convergence process.

We take [0.5488, 0.7152, 0.6028, 0.5449, 0.4237, 0.6459] as the initial vector, choose ξ = 0.2 , and changea, b to observe the convergence of the algorithm (6). The number of iterations required by the algorithm (6) under different parameters is shown in Table 1.

We can see that when one parameter is much larger than the other, the smaller parameter will lose its effect. We must keep the two parameters in the same order of magnitude. This is because the covariance matrix C and the identity matrix I are in the same order of magnitude in the experiment. If one coefficient is larger, the effect of the other coefficient will be reduced.

From Table 1, we can see that when a , b [ 0.01 , 0.1 ] the convergence speed is the fastest. Next, we take a value in this interval, and the convergence rate hardly changes. This shows that the algorithm is not sensitive to changes of a and b in the same order of magnitude.

We use [0.0055, 0.0072, 0.006, 0.0054, 0.0042, 0.0065] as the initial vector, we found when a , b [ 100 , 1000 ] the convergence speed is the fastest. Then we use [59.3932, 74.367, 64.2487, 59.0395, 48.1289, 68.1305] as the initial vector, we found when a , b [ 10 6 , 10 5 ] the convergence speed is the fastest. This shows that when we choose a larger initial vector, we should choose smaller a, b and vice versa. The above three situations are all satisfied w ( 0 ) T A w ( 0 ) ( 0.01 , 1 ) . It means when we choose w ( 0 ) a and b satisfied w ( 0 ) T A w ( 0 ) ( 0.01 , 1 ) , the convergence speed of the algorithm (6) will reach the fastest.

4.4. Online Data Experiment

Next, we consider the case of online data. Suppose the input sequence { x ( k ) | x ( k ) R n ( k = 0 , 1 , 2 , ) } is a zero mean stationary stochastic process. Because the data come in one by one online, we cannot get the covariance matrix C in advance. According to [4] C can be expressed as

C k = β C k 1 + ( x ( k ) x ( k ) T β C k 1 ) k (64)

where β is a coefficient, when x ( k ) comes from a stationary process, then β = 1 , else β ( 0 , 1 ) . According to The Law of Large Numbers, P ( C C k F ε ) converges with probability 1.

We randomly generate data and add a new data every iteration. We still use ε = 0.0001 as the stopping criterion, and find that different initial vectors and different ξ will affect the convergence speed of the algorithm. Experiments with different initial values are just like offline data.

We studied the influence of parameter ξ on the convergence of the algorithm and selected the same initial value. The result is shown in Figure 6. It is found that the smaller the ξ is, the faster the algorithm can converge.

Figure 6. Convergence of algorithms with different ξ .

Table 1. Number of iterations with different a, b.

5. Conclusion

In this paper, we proposed a Generalized Adaptive Learning Rate (GALR) PCA Algorithm, which could be guaranteed that the algorithm’s convergence process will not be affected by the selection of the initial value. Using the DDT method, we gave the upper and lower bounds of the algorithm and proved the global convergence. We no longer need the learning rate to tend to zero. Numerical experiments have also verified our theory. We discussed the relationship between the initial vector and the parameters a, b. Our algorithm is effective for both online and offline data.

Cite this paper: Zhu, Z. , Ye, W. and Kuang, H. (2021) Convergence Analysis of a Kind of Deterministic Discrete-Time PCA Algorithm. Advances in Pure Mathematics, 11, 408-426. doi: 10.4236/apm.2021.115028.
References

[1]   Bouwmans, T., Javes, S., et al. (2018) On the Applications of Robust PCA in Image and Video Processing. Proceedings of the IEEE, 106, 1427-1457.
https://doi.org/10.1109/JPROC.2018.2853589

[2]   Oja, E. (1982) Simplified Neuron Model as a Principal Component Analyzer. Journal of Mathematical Biology, 15, 267-273.
https://doi.org/10.1007/BF00275687

[3]   Xu, L. (1993) Least Mean Square Error Reconstruction Principle for Self-Organizing Neural-Nets. Neural Networks, 6, 627-648.
https://doi.org/10.1016/S0893-6080(05)80107-8

[4]   Chatterjee, C., Kang, Z. and Roychowdhury, V.P. (2000) Algorithms for Accelerated Convergence of Adaptive PCA. IEEE Transactions on Neural Networks, 11, 338-355.
https://doi.org/10.1109/72.839005

[5]   Xu, L. and Yuille, A.L. (1995) Robust Principal Component Analysis by Self-Organizing Rules Based on Statistical Physics Approach. IEEE Transactions on Neural Networks, 6, 131-143.
https://doi.org/10.1109/72.363442

[6]   Wang, S., Liang, Y.L. and Ma, F. (1998) An Adaptive Robust PCA Neural Network. The 1998 IEEE International Joint Conference on Neural Networks Proceedings, Vol. 3, 2288-2293.
https://doi.org/10.1109/IJCNN.1998.687218

[7]   Yang, T.-N. and Wang, S.D. (1999) Robust Algorithms for Principal Component Analysis. Pattern Recognition Letters, 20, 927-933.
https://doi.org/10.1016/S0167-8655(99)00060-4

[8]   Chen, T.P., Hua, Y.B. and Yan, W.-Y. (1998) Global Convergence of Oja’s Subspace Algorithm for Principal Component Extraction. IEEE Transactions on Neural Networks, 9, 58-67.
https://doi.org/10.1109/72.655030

[9]   Zhang, Q. and Bao, Z. (1995) Dynamical Systems for Computing the Eigenvectors Associated with the Largest Eigenvalue of a Positive Definite Matrix. IEEE Transactions on Neural Networks, 6, 790-791.
https://doi.org/10.1109/72.377989

[10]   Zhang, Q. and Leung, Y.-W. (2000) A Class of Learning Algorithms for Principal Component Analysis and Minor Component Analysis. IEEE Transactions on Neural Networks, 11, 529-533.
https://doi.org/10.1109/72.839022

[11]   Zhang, Q.F. (2003) On the Discrete-Time Dynamics of a PCA Learning Algorithm. Neurocomputing, 55, 761-769.
https://doi.org/10.1016/S0925-2312(03)00439-9

[12]   Zufiria, P.J. (2002) On the Discrete-Time Dynamics of the Basic Hebbian Neural-Network Nods. IEEE Transactions on Neural Networks, 13, 1342-1352.
https://doi.org/10.1109/TNN.2002.805752

[13]   Yi, Z., Ye, M., Lv, J.C. and Tan, K.K. (2005) Convergence Analysis of a Deterministic Discrete Time System of Oja’s PCA Learning Algorithm. IEEE Transactions on Neural Networks, 16, 1318-1328.
https://doi.org/10.1109/TNN.2005.852236

[14]   Jian, C.L., Zhang, Y. and Tan, K.K. (2006) Global Convergence of Oja’s PCA Learning Algorithm with a Non-Zero-Approaching Adaptive Learning Rate. Theoretical Computer Science, 367, 286-307.
https://doi.org/10.1016/j.tcs.2006.07.012

[15]   Kong, X.Y., Hu, C.H. and Duan, Z.S. (2017) Principal Component Analysis Networks and Algorithms. Springer Nature, Berlin.
https://doi.org/10.1007/978-981-10-2915-8

 
 
Top