Speaker verification is a subtask of speaker recognition, whose purpose is to verify whether a segment of speech is spoken by a designated speaker  . Total variability factor analysis has been widely used in speaker verification    . In total variability factor analysis, the speaker and the channel variabilities are contained simultaneously in a low-dimensional space which is referred to as the total variability space. By the space mapping, the useful information can be obtained by reducing the dimensionality of the mean supervector of the Gaussian mixture model (GMM) and the latent variables can be estimated using limited data. The low-dimensional variable characteristic of the speaker’s identity is called the total variability factor vector, or i-vector. Support vector machine (SVM) can be used as a classifier for i-vector  .
As an application of probabilistic principal component analysis (PPCA), total variability factor analysis only analyzes the speech data from a global perspective  . To compensate for the deficiency, we introduced locality preserving projection (LPP) , neighborhood preserving embedding (NPE) , and discriminant neighborhood embedding (DNE)  to speaker verification. By constructing a graph containing the neighborhood information of the speech data, the inherent local neighborhood relationship of the speech data is optimally preserved. Combined with total variability factor analysis, the performance of speaker verification is improved  . Here, LPP is an unsupervised learning algorithm   that is not concerned with the speaker label information in the dimensionality-reduction process and does not make use of the discriminative information between the speech data of different speakers. However, the speaker label information of the training data and the discriminative information of the speech data are of great importance in speaker verification.
In view of the above shortcomings of LPP, we apply the locality preserving discriminant projection (LPDP) algorithm in speaker verification. LPDP can bring in the speaker label information from the speech data and, through optimization, preserve the inherent local manifold structure of the speech data samples from the same speaker to reduce the distance between them. At the same time, the distance between the speech data samples from different speakers is enlarged to enhance the discriminative ability of the embedding space.
The remainder of this paper is organized as follows. The LPP algorithm based on i-vector is introduced in Section 2. The LPDP algorithm is proposed in Section 3. The experiment and results are presented in Section 4. The conclusion is given in Section 5.
2. LPP Algorithm Based on I-Vector
2.1. Total Variability Factor Analysis
Based on the total variability space, the GMM mean supervector containing speaker and channel information in the speech data can be expressed as
where m is the mean supervector of the universal background model (UBM) independent of the speaker and channel; T is the total variability space which is defined by the total variability matrix; and w is a low-dimensional latent variable that obeys the normal distribution, known as the total variability factor vector, or identity vector (i-vector). Total variability factor analysis can be regarded as a feature-extraction module. It projects the speech data into the low-rank total variability space T to obtain the i-vector w. The training method of T and the extraction process of the i-vector have been described previously  .
The intersession compensation can be carried out in a low-dimensional space where the i-vector lies. The linear discriminant analysis (LDA) approach  and within class covariance normalization (WCCN) approach  are often used for intersession compensation. After the intersession compensation, modeling and scoring are made using SVM.
2.2. LPP Algorithm
The speaker verification system framework, in which the LPP algorithm based on i-vector is used, is presented in Figure 1. The dashed boxes from left to right refer to Enrollment, Training and Testing, respectively.
On the basis of i-vector, the LPP algorithm is used to achieve an effective combination of the total variability factor analysis technique and the LPP algorithm that retains both the global and local neighborhood structures of the speech data, thereby significantly improving system performance . However, the known speaker label information of the speech data is not used in the dimensionality-reduction process of the LPP algorithm. As a result, although the locality-preserving projection space matrix P has a strong descriptive ability, its discriminative ability is not strong, which to a certain degree affects the recognition
Figure 1. The framework of speaker verification system by using LPP algorithm based on i-vector.
performance of the system.
3. LPDP Algorithm
LPDP is an effective manifold learning method that has been successfully applied in face recognition . The basic idea of LPDP is to divide the nearest neighbor graph in the LPP algorithm into intra-class and out-of-class graphs. LPDP can maintain the local neighborhood relationship of the same speaker's speech data samples and reduce the distance between them. At the same time, LPDP emphasizes the discrimination information between speakers and expands the distance between their speech data. Combined with total variability factor analysis, the algorithm can globally and locally analyze the feature structure of speech data more comprehensively, and at the same time reflects the between-speaker difference and enhances the discriminatory ability of the embedding space.
The idea of applying LPDP to speaker verification is similar to that of LPP as shown in Figure 1. The corresponding i-vectors of given N items of training speech data with speaker labels constitute a vector set , where , . The purpose of LPDP is to find an optimal locality preserving discriminant projection space matrix and embed the i-vector of the speech in space RD in the feature-space RK (K < D). In the RK space, the speech data point xi is transformed to . The steps to train the locality preserving discriminant projection space matrix A are as follows.
Step 1: Determine the neighborhood of the i-vector wi, which consists of all the i-vectors whose similarity with wi is less than its average similarity, i.e.,
where MS (wi) is the average similarity of all the N i-vectors for the training speech data with i-vector wi, and NB (wi) represents the neighborhood i-vectors of wi.
Step 2: Construct two subgraphs of the neighborhood graph: the in-class graph Gin and out-of-class graph Gout. In both the in-class graph Gin and the out-of-class graph Gout, the i-th node corresponds to the i-vector wi. For the in-class graph Gin, we put a directed edge from node i to j if i-vector wj is in the neighborhood of i-vector wi and is from the same class as i-vector wi. For the out-class graph Gout, we put a directed edge from node i to j if i-vector wj is in the neighborhood of i-vector wi but is from the different class of wi.
Step 3: Calculate the weights of the edges in Gin and Gout, and obtain their respective weight matrices, Win and Wout.
1) Denote the weight of the edge between i-vector wi and i-vector wj in Gin as and choose its value as
2) Denote the weight of the edge between i-vector wi and i-vector wj in Gout as and choose its value as
Here, spk (wi) represents the speaker label information of i-vector wi, and t is the mean distance of all the i-vectors for the training speech data.
Step 4: Calculate the locality preserving discriminant projection matrix A. The idea of LPDP is that, in the embedding space, the i-vectors from the same speaker have the smallest in-class divergence after projection, i.e., the distance between the same speaker’s i-vectors is as small as possible. Conversely, the i-vectors from different speakers have the largest between-class divergence after projection, i.e., they are as far from each other as possible. To achieve these goals, they are integrated into the following two optimization problems :
where is a Laplace operator for the in-class graph, Din is a diagonal matrix, , is a Laplace operator for the out-of-class graph, Dout is a diagonal matrix, and .
Using the constraint condition , (6) and (7) can be integrated into one optimization problem,
which can be further transformed to a generalized eigenvalue problem,
By solving Equation (9), the locality-preserving discriminant projection space matrix can be obtained, where are the eigenvectors corresponding to the largest K eigenvalues of the above problem.
4.1. Experimental Setup
Experiments were carried out on the core test set of the NIST SRE 2010 telephone training and telephone testing dataset. Equal error rate (EER) and minimum detect cost function (minDCF) were used as metrics for system evaluation  .
In the experiments, 36-dimensional Mel Frequency Cepstral Coefficient (MFCC) including 18 MFCC coefficients and their first order derivatives were utilized. Each frame of a speech utterance was processed by a 20 ms Hamming window with 10 ms shift. To mitigate channel effects, feature warping, cepstral mean subtraction (CMN) and cepstral variance normalization (CVN) were applied to the features.
Two gender dependent universal background models (UBM) with a Gauss number of 1024 were trained using the NIST SRE 2004 1-side dataset. The gender related total variability matrix T, LPP matrix, LPDP matrix, WCCN, and LDA matrix were trained by the NIST SRE 2004, 2005, and 2006 corpus. The background data for SVM were also selected from the data of NIST SRE 2004, 2005 and 2006 datasets. The SVM Light toolkit was used for SVM modeling .
4.2. Experimental Results
To verify the performance of the proposed LPDP algorithm, we experimentally compared it with the traditional total variability factor analysis and LPP algorithms.
Table 1 shows the performance comparison of the three algorithms without channel compensation. It is observed that applying the LPDP algorithm to i-vector is equivalent to effectively combining total variability factor analysis technology with the LPP algorithm. This combination can maintain the global and local neighborhood structures of the speech data. Compared to total variability factor analysis, which can only preserve the global structure of speech data, LPP and LPDP can significantly improve system performance. LPDP can also make effective use of the speaker label information of the speech data and, through optimization, maintain the intrinsic local manifold structure of the same speaker's speech data. As well, the distance between the speech data of different speakers is expanded in LPDP and the discrimination performance of the embedding space is enhanced to further improve system performance. Compared with LPP, LPDP leads to a relative improvement of 16.36% in EER and 13.04% in minDCF for male testing dataset, and 29.33% in EER and 8.67% in minDCF for female testing dataset.
Table 1. Comparison of EER and minDCF of LPDP, LPP, and total variability factor analysis (without channel compensation).
Table 2 shows the experimental results of the three algorithms with LDA intersession compensation. The table shows that, with LDA channel compensation, LPDP performs better than LPP. For male and female testing dataset, EER of the LPDP system was relatively improved by 23.78% and 26.67%, respectively, and minDCF was relatively improved by 11.18% and 5.95%, respectively.
Table 3 shows the experimental results of the three algorithms with WCCN intersession compensation. When compared to the performance of LPP with WCCN channel compensation, Table 3 shows that LPDP system outperforms the LPP system, yielding 11.81% relative improvement in EER and 6.85% in minDCF for male testing dataset, as well as 8.2% relative improvement in EER and 5.19% in minDCF for female testing dataset.
Table 4 shows the experimental results of the three algorithms after performing both LDA and WCCN. The table shows that LPDP still outperformed LPP when channel compensation was provided by both LDA and WCCN. Compared to the performance of LPP, the LPDP system gives additional gains of 9.16% and 11.47% respectively in EER and minDCF for male testing dataset, as well as 10.94% and 10.40% respectively in EER and minDCF for female testing dataset.
Table 2. Comparison of EER and minDCF of LPDP, LPP, and total variability factor analysis (LDA channel compensation).
Table 3. Comparison of EER and minDCF of LPDP, LPP, and total variability factor analysis (WCCN channel compensation).
Table 4. Comparison of EER and minDCF of LPDP, LPP, and total variability factor analysis (LDA + WCCN channel compensation).
On the basis of LPP, this paper introduced LPDP to speaker verification. LPDP makes full use of the speaker label information of the speech data to categorize and differentiate the neighborhood. It can overcome the shortcomings of the total variability factor analysis method and maintain the intrinsic local neighborhood relationship of in-class (same speaker) speech data and more comprehensively reflect the global and local structure of the speech data. It can also address the inadequacy of LPP and maximize the distance between out-of-class (different speakers) speech data to obtain the most discriminative feature vector and enhance the discriminative ability of the projection space, thereby improving the recognition performance of the system. Our future work will be devoted to enhance the discrimination of the embedding space and further improve the recognition performance of the system.
This work was supported by the National Natural Science Foundation of China (No.11704229).
 Naika, R. (2018) An Overview of Automatic Speaker Verification System. In: Intelligent Computing and Information and Communication, Springer, Singapore, 603-610.
 Chen, C. and Han, J.Q. (2018) Partial Least Squares Based Total Variability Space Modeling for I-Vector Speaker Verification. Chinese Journal of Electronics, 27, 1229-1233.
 Zhang, X., Zou, X., Sun, M., Zheng, T.F., Jia, C. and Wang, Y. (2019) Noise Robust Speaker Recognition Based on Adaptive Frame Weighting in Gmm for I-Vector Extraction. IEEE Access, 7, 27874-27882.
 Ibrahim, N.S. and Ramli, D.A. (2018) I-Vector Extraction for Speaker Recognition Based on Dimensionality Reduction. Procedia Computer Science, 126, 1534-1540.
 Mak, M., Pang, X. and Chien, J. (2016) Mixture of PLDA for Noise Robust I-Vector Speaker Verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24, 130-142.
 Yang, J., Liang, C., Yang, L., Suo, H., Wang, J. and Yan, Y. (2012) Factor Analysis of Laplacian Approach for Speaker Recognition. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, 25-30 March 2012, 4221-4224.
 Liang, C.Y., Yang, L., Zhao, Q.W. and Yang, Y.H. (2012) Factor Analysis of Neighborhood-Preserving Embedding for Speaker Verification. IEICE Transactions on Information & Systems, 95, 2572-2576.
 Chien, J. and Hsu, C. (2017) Variational Manifold Learning for Speaker Recognition. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, 5-9 March 2017, 4935-4939.
 Wu, D. (2015) Speaker Recognition Based on I-Vector and Improved Local Preserving Projection. In: Proceedings of the 2015 Chinese Intelligent Automation Conference, Springer, Heidelberg, 115-121.
 He, X.F., Yan, S.C., Hu, Y.X., Niyogi, P. and Zhang, H.-J. (2005) Face Recognition Using Laplacianfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 328-340.
 Haeb-Umbach, R. and Ney, H. (1992) Linear Discriminant Analysis for Improved Large Vocabulary Continuous Speech Recognition. 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, San Francisco, 23-26 March 1992, 13-16.
 Zhao, Z.H. and Hao, X.H. (2014) Linear Locality Preserving and Discriminating Projection for Face Recognition. Journal of Electronics & Information Technology, 35, 463-467.
 Wang, J.F. and Gao, Q. (2015) Discriminant Neighborhood Structure Embedding Using Trace Ratio Criterion for Image Recognition. Journal of Computer & Communications, 3, 64-70.
 Scheffer, N., Ferrer, L., Graciarena, M., Kajarekar, S., Shriberg, E. and Stolcke, A. (2011) The SRI NIST 2010 Speaker Recognition Evaluation System. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, 22-27 May 2011, 5292-5295.