APM  Vol.8 No.4 , April 2018
Low-Rank Sparse Representation with Pre-Learned Dictionaries and Side Information for Singing Voice Separation
ABSTRACT
At present, although the human speech separation has achieved fruitful results, it is not ideal for the separation of singing and accompaniment. Based on low-rank and sparse optimization theory, in this paper, we propose a new singing voice separation algorithm called Low-rank, Sparse Representation with pre-learned dictionaries and side Information (LSRi). The algorithm incorporates both the vocal and instrumental spectrograms as sparse matrix and low-rank matrix, meanwhile combines pre-learning dictionary and the reconstructed voice spectrogram form the annotation. Evaluations on the iKala dataset show that the proposed methods are effective and efficient for singing voice separation.

1. Introduction

Separating singing voice from music recording is very useful in many applications, such as music information retrieval, singer identification and lyrics recognition and alignment [1] . Although the human auditory system can easily distinguish the vocal and instrumental of music recording, it is extremely difficult for computer systems. In this context, researchers are increasingly concerned with the mining of music information. Many algorithms have been proposed to separate singing voice from music recording.

Robust Principal Component Analysis (RPCA) is a matrix factorization algorithm for solving underlying low-rank and sparse matrices [2] . Suppose we are given a large data matrix M, and know that it may be decomposed as X = A + E , where A is a low-rank matrix and E is a sparse matrix. Based on RPCA, Huang et al. [3] have separated singing-voice from music accompaniment. They assumed that the repetitive music accompaniment lies in a low-rank subspace, while the singing voices can be regarded as sparse within songs. The main drawback to this approach is that it is completely unsupervised, just based on the particular properties of each individual components to guide the decomposition. After, Yu et al. [4] utilized any pre-learned information and pre-learned universal voice and music dictionaries from isolated singing voice and background music training data. They proposed Low-rank and Sparse representation with Pre-learned Dictionaries (LSPD) for singing voice separation. Chan et al. [5] proposed a modified RPCA algorithm. This work represented one of the first attempts to incorporate vocal activity information into the RPCA algorithm, then the vocal activity detection was widely studied [6] [7] . Chan et al. [8] proposed to separate singing voice by group-sparse representation with the idea of pitch annotations separation.

In this paper, we present a model named Low-rank, Sparse representation with pre-learned dictionaries and side information (LSRi) under the ADMM framework. First, we pre-learn voice and music dictionaries from isolated singing voice and background music training data, respectively. Then, we use a sparse spectrogram and a low-rank spectrogram to model the singing voice and the background music, respectively. Outside, a residual term is added to capture the components that are not well modeled by either the sparse or the low-rank term. Finally, we combine the reconstructed voice spectrogram from the vocal annotation. Evaluations on the iKala dataset [9] show its better performance than comparison methods.

The rest of this paper is organized as follows. The overview of the music analysis model is presented in Section 2. The description of theoretical knowledge and experimental results are presented in Section 3. Final Section concludes this work.

2. The Proposed Method

Before we come up with our method, let’s review the Low-rank and Sparse representation with Pre-learned Dictionaries (LSPD) method [4] ,

min Z 1 , Z 2 Z 1 * + λ 1 Z 2 1 + λ 2 E 1 s . t . X = D 1 Z 1 + D 2 Z 2 + E (1)

where X is the input spectrogram, D 1 R m × k 1 is a pre-learned dictionary of the music accompaniment, D 2 R m × k 2 is a pre-learned dictionary of the singing voice, D 1 Z 1 is the separated instrumentals, D 2 Z 2 is the separated voice. E denotes the residual part. λ 1 , λ 2 are two weighting parameters for balancing the different regularization terms in this model.

Compared with the unsupervised RPCA algorithm, the LSPD algorithm adds pre-learning dictionary information and improves the separation quality. To further improve the separation quality of singing voice and music accompaniment, we proposed Low-rank, Sparse Representation with pre-learned dictionaries and side Information (LSRi).

In our model, we considered more prior information i.e., the reconstructed voice spectrogram from the annotation. Model as follows,

min Z 1 , Z 2 Z 1 * + λ 1 Z 2 1 + λ 2 E 1 + γ 2 D 2 Z 2 E 0 F 2 s . t . X = D 1 Z 1 + D 2 Z 2 + E (2)

Here all parameters in model 2 are in accordance with model 1, and E 0 denotes the reconstructed voice spectrogram from the annotation. F denotes the Frobenius norm. In the following, we also use the ADMM algorithm [10] to solve the optimization problem, by introducing two auxiliary variables J 1 and J 2 as well as three equality constraints,

min Z 1 , Z 2 , J 1 , J 2 J 1 * + λ 1 J 2 1 + λ 2 E 1 + γ 2 D 2 Z 2 E 0 F 2 s . t . X = D 1 Z 1 + D 2 Z 2 + E , Z 1 = J 1 , Z 2 = J 2 (3)

The unconstrained augmented Lagrangian L is given by

L = J 1 T * + λ 1 J 2 1 + λ 2 E 1 + γ 2 D 2 Z 2 E 0 F 2 + Y 1 , X D 1 Z 1 D 2 Z 2 E + Y 2 , Z 1 J 1 + Y 3 , Z 2 J 2 + μ 2 ( X D 1 Z 1 D 2 Z 2 E F 2 + Z 1 J 1 F 2 + Z 2 J 2 F 2 ) (4)

where Y 1 , Y 2 , Y 3 are the Lagrange multipliers. We then iteratively update the solutions for J 1 , Z 1 , J 2 and Z 2 .

1) Update J 1 :

J 1 = arg min J 1 J 1 * + μ 2 J 1 ( Z 1 + μ 1 Y 2 ) F 2 = U S 1 μ [ Σ ] V T (5)

where U Σ V = s v d ( Z 1 + μ 1 Y 2 ) .

2) Update Z 1 :

L Z 1 = D 1 T Y 1 + Y 2 μ D 1 T ( X D 1 Z 1 D 2 Z 2 E ) + μ ( Z 1 J 1 ) (6)

setting L Z 1 = 0 , we have

Z 1 = ( D 1 T D 1 + I ) 1 ( D 1 T ( X D 2 Z 2 E + μ 1 Y 1 ) μ 1 Y 2 + J 1 ) (7)

3) Update J 2 :

J 2 = arg min J 2 λ 1 J 2 1 + μ 2 J 1 ( Z 1 + μ 1 Y 3 ) F 2 (8)

that can be solve by the soft-threshold operator

J 2 = S λ 1 μ ( Z 2 + μ 1 Y 3 ) (9)

since the spectrogram is non-negative

J 2 = max { S λ 1 μ ( Z 2 + μ 1 Y 3 ) , 0 } (10)

where 0 is an all zero matrix of the size as J 2 .

4) Update Z 2 :

L Z 2 = γ D 2 T ( D 2 Z 2 E 0 ) D 2 T Y 1 + Y 3 μ D 2 T ( X D 1 Z 1 D 2 Z 2 E ) + μ ( Z 2 J 2 ) (11)

setting L Z 2 = 0 , we have

Z 2 = ( ( γ + μ ) D 2 T D 2 + μ I ) 1 ( γ D 2 T E 0 + D 2 T Y 1 Y 3 + μ D 2 T ( X D 1 Z 1 E ) + μ J 2 ) = ( ( γ μ + 1 ) D 2 T D 2 + I ) 1 ( D 2 T ( X D 1 Z 1 E + γ μ E 0 + 1 μ Y 1 ) 1 μ Y 3 + J 2 ) (12)

5) Update E:

E = arg min E λ 2 E 1 + μ 2 E ( X D 1 Z 1 D 2 Z 2 + μ 1 Y 1 ) F 2 (13)

Similar to J 2 ,

E = max { S λ 2 μ ( X D 1 Z 1 D 2 Z 2 + μ 1 Y 1 ) , 0 } (14)

Finally, we update the Lagrange multipliers as in [11] .

3. Experiment

3.1. Dataset

Our experiment was conducted on the iKala dataset [9] . The iKala dataset contains 252 30-second clips of Chinese popular songs in CD quality. In the following experiments, we randomly select 44 songs for training (i.e., learning the dictionaries D1 and D2), leaving 208 songs for testing the performance of separation. To reduce the computational cost and the memory footprint of the proposed algorithm, we down sample all the audio recordings from 44,100 to 22,050 Hz. Then, computed its STFT by sliding a Hamming window of 1411 samples with a 75% overlap to obtain the spectrogram.

3.2. Dictionary and E0

Our implementation of Online Dictionary Learning for Sparse Coding (ODL) [12] is based on the SPAMS toolbox. Given N signals ( x i m ), ODL learns a dictionary D by solving the following joint optimization problem,

min D 0, α 1 N i = 1 N ( 1 2 x i D α i 2 2 + λ α i 1 ) s . t . d j T d j 1, α i 0 (15)

where 2 denotes the Euclidean and λ is a regularization parameter. The input frames are extracted from the training set after short-time Fourier transform (STFT). Following [8] , we define the dictionary size to be 100 atoms.

To get the reconstructed voice spectrogram from the annotation (E0), we first transform the human-labeled vocal pitch contours into a time-frequency binary mask. The authors in [13] have proposed a harmonic mask similar to that of [14] , which only passes integral multiples of the vocal fundamental frequencies [15] [16] ,

M ( f , t ) = { 1 , if | f n F 0 ( t ) | < w / 2 , n N + 0 , otherwise . (16)

Here F 0 ( t ) is the vocal fundamental frequency at time t, n is the order of the harmonic, and w is the width of the mask. Then we simply define the vocal annotations as E 0 = X M , where denotes the Hadamard product.

3.3. Evaluation

1http://bass-db.gforge.inria.fr/.

Separation performance is measured by BSS EVAL toolbox version 3.01. We use source-to-interference ratio (SIR), source-to-artifacts ratio (SAR) and source-to-distortion ratio (SDR) provided in the commonly used BSS EVAL toolbox version 3.0. Denotes the singing voice v ^ , the original clean singing voice v, the source-to-distortion ratio (SDR) [17] is computed as follows,

SDR ( v ^ , v ) = 10 log 10 [ v ^ , v 2 v ^ 2 v 2 v ^ , v 2 ] . (17)

Normalized SDR (NSDR) is the improvement of SDR from the original mixture x to the separated singing voice v ^ [18] [19] , and is commonly used to measure the separation performance for each mixture,

NSDR ( v ^ , v , x ) = SDR ( v ^ , v ) SDR ( x , v ) . (18)

For overall performance evaluation, the global NSDR (GNSDR) is calculated as,

GNSDR = i = 1 N w i NSDR ( v ^ i , v i , x i ) i = 1 N w i , (19)

where N is the total number of the songs and wi is the length of the i-th song. Higher values of SIR, SAR, SDR, GSIR, GSAR, GSDR and GNSDR represent better quality of the separation.

3.4. Parameter Selection

During parameter selection, we use the indicator of global normalized source-to-distortion ratio (GNSDR) as the evaluation index. The higher the value is, the better the separation quality is. In our algorithms, we set λ 1 = λ 2 = 1 / max ( m , n ) for each X R m × n similar to [9] , Here we only adjust γ.

Figure 1 presents the GNSDR for the separated singing voice and background music, using LSPDi. In the vocal part, we can see that, the GNSDR monotonically increases with γ first and then gradually decreases. When γ = 5 , the LSRi achieves the overall highest GNSDR. In the accompaniment part, the values of GNSDR increase first, steady after γ = 5 . Therefore, we set the parameter γ = 5 .

3.5. Comparison Results

We compare three different Low-rank, Sparse algorithms on the iKala dataset,

・ RPCA unsupervised method proposed by Huang et al. [3] , use default parameter values λ = 1 max ( m , n ) .

・ LSPD Supervised method proposed by Yu et al. [4] , use default parameter values λ 1 = λ 2 = 1 max ( m , n ) .

・ LSRi Proposed LSRi method with Low-Rank representation and the reconstructed voice spectrogram from the annotation, λ 1 = λ 2 = 1 max ( m , n ) and γ = 5 .

Figure 1. Separation performance measured by GNSDR for the singing voice (left) and background music (right), using our proposed method LSPDi.

Table 1. Separation quality for the singing voice and music for the iKala dataset of RPCA, LSPD and LSRi.

As shown in Table 1, whether the singing part or the accompaniment, our method has a higher value of global normalized source-to-distortion ratio (GNSDR), which suggests that LSRi algorithm performs well in the overall separation performance, and introduction of prior knowledge improve the separation performance. In the vocal part, our algorithm achieves higher GSIR than RPCA and LSPD, which shows that LSRi has better ability to remove the instrumental sounds than RPCA and LSPD. In the background music part, our algorithm achieves higher GSIR, which suggests that LSRi has better ability to remove the singing, a better performs in limiting artifacts during the separation process. But GSAR values did not improve significantly, this indicates that we need to improve on eliminating the interference of the algorithm.

4. Conclusion

In this paper, we have presented a time-frequency based source separation algorithm for music signals. LSRi considers both the vocal and instrumental spectrograms as sparse matrix and low-rank matrix, respectively. And the components that are not identified parts are specified as a residual term. Note that the dictionaries for the singing voice and background music are pre-learned from isolated singing voice and background music training data, respectively. Furthermore, LSRi incorporates vocal annotations information further, through which prior knowledge of the voice and background music is introduced to the source separation processing. Our approach has successfully exploited relevant useful information. Evaluations on the iKala dataset show the proposed methods better performance for both the separated singing voice and music accompaniment. In future studies, we can consider applying LSRi to the separation of complete songs.

Cite this paper
Yang, C. and Zhang, H. (2018) Low-Rank Sparse Representation with Pre-Learned Dictionaries and Side Information for Singing Voice Separation. Advances in Pure Mathematics, 8, 419-427. doi: 10.4236/apm.2018.84024.
References
[1]   Li, Y. and Wang, D.L. (2007) Separation of Singing Voice from Music Accompaniment for Monaural Recordings. IEEE Transactions on Audio, Speech and Language Processing, 15, 1475-1487.
https://doi.org/10.1109/TASL.2006.889789

[2]   Candes, E.J., Li, X., Ma, Y. and Wright, J. (2011) Robust Principal Component Analysis? Journal of the ACM, 58, 1-37.
https://doi.org/10.1145/1970392.1970395

[3]   Huang, P.S., Chen, S.D., Smaragdis, P. and Johnson, M.H. (2012) Singing Voice Separation from Monaural Recordings Using Robust Principal Component Analysis. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, 25-30 March 2012, 57-60.
https://doi.org/10.1109/ICASSP.2012.6287816

[4]   Yu, S., Zhang, H. and Duan, Z. (2017) Singing Voice Separation by Low-Rank and Sparse Spectrogram Decomposition with Pre-Learned Dictionaries. Journal of the Audio Engineering Society, 65, 377-388.
https://doi.org/10.17743/jaes.2017.0009

[5]   Chan, T.S., Yeh, T.C., Fan, Z.C., Chen, H.W., Su, L., Yang, Y.H. and Jang, R. (2015) Vocal Activity Informed Singing Voice Separation with the iKala Dataset. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, 19-24 April 2015, 718-722.
https://doi.org/10.1109/ICASSP.2015.7178063

[6]   Lehner, B., Widmer, G. and Sonnleitner, R. (2014) On the Reduction of False Positives in Singing Voice Detection. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 4-9 May 2014, 7480-7484.
https://doi.org/10.1109/ICASSP.2014.6855054

[7]   Yoshii, K., Fujihara, H., Nakano, T. and Goto, M. (2014) Cultivating Vocal Activity Detection for Music Audio Signals in a Circulation Type Crowd Sourcing Ecosystem. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 4-9 May 2014, 624-628.
https://doi.org/10.1109/ICASSP.2014.6853671

[8]   Chan, T.S. and Yang, Y.H. (2017) Informed Group-Sparse Representation for Singing Voice Separation. IEEE Signal Processing Letters, 24, 156-160.

[9]   Chan, T.S., Yeh, T.C., Fan, Z.C., Chen, H.W., Sui, L., Yang, Y.H. and Jang, R. (2015) Vocal Activity Informed Singing Voice Separation with the iKala Dataset. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, 19-24 April 2015, 718-722.
https://doi.org/10.1109/ICASSP.2015.7178063

[10]   Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3, 1-122.
https://doi.org/10.1561/2200000016

[11]   Ma, S. (2016) Alternating Proximal Gradient Method for Convex Minimization. Journal of Scientific Computing, 68, 546-572.
https://doi.org/10.1007/s10915-015-0150-0

[12]   Mairal, J., Bach, F., Ponce, J. and Sapiro, G. (2009) Online Dictionary Learning for Sparse Coding. Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, 14-18 June 2009, 689-696.
https://doi.org/10.1145/1553374.1553463

[13]   Ikemiya, Y., Yoshii, K. and Itoyama, K. (2015) Singing Voice Analysis and Editing Based on Mutually Dependent F0 Estimation and Source Separation. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, 19-24 April 2015, 574-578.
https://doi.org/10.1109/ICASSP.2015.7178034

[14]   Virtanen, T., Mesaros, A. and Ryynanen, M. (2008) Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music. ITRW on Statistical and Perceptual Audio Processing, Brisbane, 21 September 2008, 17-22.

[15]   Durrieu, J.L., David, B. and Richard, G. (2011) A Musically Motivated Midlevel Representation for Pitch Estimation and Musical Audio Source Separation. IEEE Journal of Selected Topics in Signal Processing, 5, 1180-1191.
https://doi.org/10.1109/JSTSP.2011.2158801

[16]   Ryynanen, M., Virtanen, T., Paulus, J. and Klapuri, A. (2008) Accompaniment Separation and Karaoke Application Based on Automatic Melody Transcription. 2008 IEEE International Conference on Multimedia and Expo, 23 June-26 April 2008, Hannover, 1417-1420.

[17]   Gribonval, R., Benaroya, L., Vincent, E. and Fvotte, C. (2003) Proposals for Performance Measurement in Source Separation. 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, April 2003, 763-768.

[18]   Ozerov, A., Philippe, P., Gribonval, R. and Bimbot, F. (2005) One Microphone Singing Voice Separation Using Source-Adapted Models. 2005 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, 16 October 2005, 90-93.
https://doi.org/10.1109/ASPAA.2005.1540176

[19]   Ozerov, A., Philippe, P., Bimbot, F. and Gribonval, R. (2007) Adaptation of Bayesian Models for Single Channel Source Separation and Its Application to Voice/Music Separation in Popular Songs. IEEE Transactions on Audio, Speech and Language, 15, 1564-1578.
https://doi.org/10.1109/TASL.2007.899291

 
 
Top