Back
 WJET  Vol.2 No.1 , February 2014
Wake-Up-Word Feature Extraction on FPGA
Abstract: Wake-Up-Word Speech Recognition task (WUW-SR) is a computationally very demand, particularly the stage of feature extraction which is decoded with corresponding Hidden Markov Models (HMMs) in the back-end stage of the WUW-SR. The state of the art WUW-SR system is based on three different sets of features: Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding Coefficients (LPC), and Enhanced Mel-Frequency Cepstral Coefficients (ENH_MFCC). In (front-end of Wake-Up-Word Speech Recognition System Design on FPGA) [1], we presented an experimental FPGA design and implementation of a novel architecture of a real-time spectrogram extraction processor that generates MFCC, LPC, and ENH_MFCC spectrograms simultaneously. In this paper, the details of converting the three sets of spectrograms 1) Mel-Frequency Cepstral Coefficients (MFCC), 2) Linear Predictive Coding Coefficients (LPC), and 3) Enhanced Mel-Frequency Cepstral Coefficients (ENH_MFCC) to their equivalent features are presented. In the WUW- SR system, the recognizer’s frontend is located at the terminal which is typically connected over a data network to remote back-end recognition (e.g., server). The WUW-SR is shown in Figure 1. The three sets of speech features are extracted at the front-end. These extracted features are then compressed and transmitted to the server via a dedicated channel, where subsequently they are decoded.
Cite this paper: V. Këpuska, M. Eljhani and B. Hight, "Wake-Up-Word Feature Extraction on FPGA," World Journal of Engineering and Technology, Vol. 2 No. 1, 2014, pp. 1-12. doi: 10.4236/wjet.2014.21001.
References

[1]   Këpuska, V.Z., Eljhani, M.M. and Hight, B.H. (2013) Front-end of wake-up-word speech recognition system design on FPGA. Journal of Telecommunications System & Management, 2, 108.

[2]   Këpuska, V.Z. and Klein, T.B. (2009) A novel wake-up-word speech recognition system, wake-up-word recognition task, technology and evaluation. Nonlinear Analysis, Theory, Methods & Applications, 71, e2772-e2789.

[3]   Tuzun, O.B., Demirekler, M. and Bora, K. (1994) Comparison of parametric and non-parametric representations of speech for recognition. 7th Mediterranean Electro-technical Conference, Antalya, 12-14 April 1994, 65-68.

[4]   Openshaw, J.P., Sun, Z.P. and Mason, J.S. (1993) A comparison of composite features under degraded speech in speaker recognition. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2, 371-374.
http://dx.doi.org/10.1109/ICASSP.1993.319316

[5]   Vergin, R., O’Shaughnessy, D. and Gupta, V. (1996) Compensated mel frequency cepstrum coefficients. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, 7-10 May 1996, 323-326.

[6]   Davis, S. and Mermelstein, P. (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28, 357-366.
http://dx.doi.org/10.1109/TASSP.1980.1163420

[7]   Combrinck, H. and Botha, E. (1996) On the mel-scaled cepstrum.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.1382&rep=rep1&type=pdf

[8]   Schroeder, M.R. (1982) Linear prediction, extremely entropy and prior information in speech signal analysis and synthesis. Speech Communication, 1, 9-20.
http://dx.doi.org/10.1016/0167-6393(82)90004-8

[9]   Paliwal, K.K. and Kleijn, W.B. (1995) Speech synthesis and coding, chapter quantization of LPC parameters. Elsevier Science Publication, Amsterdam, 433-466.

 
 
Top