Back
 IIM  Vol.13 No.4 , July 2021
Speech Signal Detection Based on Bayesian Estimation by Observing Air-Conducted Speech under Existence of Surrounding Noise with the Aid of Bone-Conducted Speech
Abstract: In order to apply speech recognition systems to actual circumstances such as inspection and maintenance operations in industrial factories to recording and reporting routines at construction sites, etc. where hand-writing is difficult, some countermeasure methods for surrounding noise are indispensable. In this study, a signal detection method to remove the noise for actual speech signals is proposed by using Bayesian estimation with the aid of bone-conducted speech. More specifically, by introducing Bayes’ theorem based on the observation of air-conducted speech contaminated by surrounding background noise, a new type of algorithm for noise removal is theoretically derived. In the proposed speech detection method, bone-conducted speech is utilized in order to obtain precise estimation for speech signals. The effectiveness of the proposed method is experimentally confirmed by applying it to air- and bone-conducted speeches measured in real environment under the existence of surrounding background noise.

1. Introduction

Many kinds of speech recognition systems have been developed according to the progress of digital information technique. For example, these systems are applied to inspection and maintenance operations in industrial factories and to recording and reporting routines at construction sites, etc. For speech recognition in such actual circumstances, some countermeasure methods for surrounding noises are indispensable.

Previously reported methods for noise reduction in speech recognition can be classified into two categories. One is based on a single microphone [1] [2] and the other uses a microphone array [3]. Since the latter requires prior information on the number of noise sources, and the number of microphones needed is larger than that of the noise sources in the case of multi-noise sources, this category demands large scale systems. Therefore, the former based on a single microphone is more advantageous than the latter [4] [5].

In such a noise reduction task for speech signals based on a single microphone, many algorithms applying Kalman filter have been proposed up to now by assuming Gaussian white noise [6] [7] [8]. The actual noises show complex fluctuation forms with non-Gaussian and non-white properties. From the above viewpoint, in our previously reported study, a noise suppression algorithm for the actual speech signals without requirement of the assumption of Gaussian white noise has been proposed [9].

Furthermore, in our previous study, a signal processing method to remove the noise for actual speech signals was proposed by jointly using the measured data of bone-and air-conducted speeches [10]. However, the algorithm of the previous method was highly complicated because it utilized lower and higher order correlations between the original speech signals, bone-and air-conducted speeches. Therefore, large computation time was required in the application to real speech signals data. Furthermore, a time transition model (i.e., system equation) of the speech signals was needed for recursive estimation, and it had to be established for each speech signal in advance.

In this study, a method to detect the speech signals is proposed by applying the Bayesian estimation based on a posterior probability with observation data of air-conducted speech contaminated by surrounding background noise. In the proposed algorithm, by regarding the probability distribution with parameters based on the measurement of bone-conducted speech as a prior probability distribution, the precise estimation of the speech signals can be achieved. Though the bone-conducted speech is a kind of solid propagation sound with less effect by the surrounding noise, the high-frequency components of the signal are damped through the propagation process [11]. On the other hand, the air-conducted speech contains all frequency components though the signal is strongly affected by the surrounding noise. Therefore, by using jointly both air-and bone-conducted speeches, more accurate estimations of the speech signals can be expected whilst recovering the high-frequency components of the speech signals even in a very noisy circumstance.

The algorithm derived in this study does not require any time transition models for speech signals, and can be applied to speech signals with arbitrary fluctuation forms. Furthermore, since only the correlation information between the speech signals and the observation of air-conducted speech is utilized in the proposed method, the estimation algorithm of the speech signals can be simplified, and the online processing can be expected due to the large reduction of the computation time. The effectiveness of the proposed method is confirmed by applying it to air-and bone-conducted speeches measured in an anechoic room at Hiroshima Prefectural Technology Research Institute cooperated with Prefectural University of Hiroshima, under the existence of surrounding background noise.

2. Detection Method for Air-and Bone-Conducted Speeches

2.1. Stochastic Model for Air-and Bone-Conducted Speeches

In the actual environment with a surrounding noise, let x k , y k and z k be the original speech signal, the observation of air-and bone-conducted speech signals at a discrete time k. The observation y k is contaminated by a surrounding background noise v k . According to the additive property of sound pressure, the following relationship can be established.

y k = x k + v k , (1)

where the statistics of v k are assumed to be known.

In order to express the relationship between the original speech signal and bone-conducted speech, the correlation information between x k and z k is necessary in general. However, it is difficult to find the information in advance because x k is an unknown signal to be estimated. In this study, a conditional probability distribution function in orthogonal expansion series is adopted as the relationship between x k and z k :

P ( x k | z k ) = P ( x k , z k ) P ( z k ) = P ( x k ) r = 0 s = 0 A r s θ r ( 1 ) ( x k ) θ s ( 2 ) ( z k ) (2)

with

A r s θ r ( 1 ) ( x k ) θ s ( 2 ) ( z k ) , (3)

where denotes the averaging operation on variables. The linear and nonlinear correlations between x k and z k are reflected hierarchically in each expansion coefficient A r s . From the definition of (3), the expansion coefficient satisfies the following conditions:

A 00 = 1 , A r 0 = A 0 s = 0 , ( r , s 1 ) . (4)

Functions θ r ( 1 ) ( x k ) and θ s ( 2 ) ( z k ) are orthonormal polynomials having weighting functions P ( x k ) and P ( z k ) respectively, and can be composed as follows:

θ r ( 1 ) ( x k ) = i = 0 r λ r i ( 1 ) x k i , θ s ( 2 ) ( z k ) = i = 0 s λ s i ( 2 ) z k i , (5)

where λ r i ( 1 ) and λ s i ( 2 ) are coefficients calculated by using Schmidt’s orthogonalization algorithm [12]. The expansion coefficients A r s with order r R , s S can be obtained from the correlation information between speech signal x k and bone-conducted speech z k . Since the speech signal is unknown in the presence noises, these coefficients have to be estimated on the basis of the observation y k . Let’s regard the expansion coefficients A r s as unknown parameter vector a .

a ( a 11 , , a R 1 , a 12 , , a R 2 , , a 1 S , , a R S ) ,

a r s A r s , ( r = 1 , 2 , , R ; s = 1 , 2 , , S ) , (6)

where ' denotes the transpose of a matrix, and R S is the number of unknown parameters to be estimated. Then a simple dynamical model:

a k + 1 = a k , (7)

is introduced for the simultaneous estimation of the parameter and the clean speech signal x k .

2.2. Derivation of Speech Signal Detection Algorithm Based on Bayesian Estimation

To derive an estimation algorithm for the speech signal x k , we place our basis on Bayes’ theorem for the conditional probability distribution [13]. Since the parameter a k is also unknown, the conditional probability distribution of x k , a k is expressed by

P ( x k , a k | Y k ) = P ( x k , a k , y k | Y k 1 ) P ( y k | Y k 1 ) , (8)

where Y k ( = { y 1 , y 2 , , y k } ) is a set of air-conducted speech data up to time k. By expanding the conditional joint probability distribution P ( x k , a k , y k | Y k 1 ) in a statistical orthogonal expansion series on the basis of the well-known standard probability distributions, which describe the dominant part of the actual fluctuation, the following expression is derived.

P ( x k , a k | Y k ) = P 0 ( x k | Y k 1 ) P 0 ( a k | Y k 1 ) l = 0 m = 0 n = 0 B l m n φ l ( 1 ) ( x k ) φ m ( 2 ) ( a k ) φ n ( 3 ) ( y k ) / n = 0 B 00 n φ n ( 3 ) ( y k ) (9)

( m = 0 m 11 = 0 m R S = 0 , m ( m 11 , , m R S ) )

with

B l m n φ l ( 1 ) ( x k ) φ m ( 2 ) ( a k ) φ n ( 3 ) ( y k ) | Y k 1 . (10)

The above three functions φ l ( 1 ) ( x k ) , φ m ( 2 ) ( a k ) and φ n ( 3 ) ( y k ) are orthonormal polynomials of degrees l, m and n with weighting functions P 0 ( x k | Y k 1 ) , P 0 ( a k | Y k 1 ) and P 0 ( y k | Y k 1 ) .

As examples of standard probability functions for the speech signal, the parameters and observations of the air-conducted speech, we adopt Gaussian distributions, as

P 0 ( x k | Y k 1 ) = N ( x k ; x k * , Γ x k ) ,

P 0 ( a k | Y k 1 ) = r = 1 R s = 1 S N ( a r s , k ; a r s , k * , Γ a r s , k ) ,

P 0 ( y k | Y k 1 ) = N ( y k ; y k * , Ω k ) (11)

with

N ( x ; μ , σ 2 ) 1 2 π σ 2 exp { ( x μ ) 2 2 σ 2 } ,

x k * x k | Y k 1 , Γ x k ( x k x k * ) 2 | Y k 1 ,

a r s , k * a r s , k | Y k 1 , Γ a r s , k ( a r s , k a r s , k * ) 2 | Y k 1 ,

y k * y k | Y k 1 , Ω k ( y k y k * ) 2 | Y k 1 . (12)

The orthonormal polynomials with three weighting probability distributions in (11) are then specified as

φ l ( 1 ) ( x k ) = 1 l ! H l ( x k x k * Γ x k ) ,

φ m ( 2 ) ( a k ) = r = R s = 1 S 1 m r s ! H m r s ( a r s , k a r s , k * Γ a r s , k ) ,

φ n ( 3 ) ( y k ) = 1 n ! H n ( y k y k * Ω k ) , (13)

where H l ( ) denotes the Hermite polynomial with lth order [14]. The non-Gaussian properties of the speech signal and observations of the air-conducted speech are reflected in each expansion coefficient B l m n .

Based on (9), the estimates of x k and a r s , k for mean can be expressed as

x ^ k x k | Y k = n = 0 { B 0 0 n C 0 0 1 , 0 + B 1 0 n C 1 0 1 , 0 } φ n ( 3 ) ( y k ) n = 0 B 0 0 n φ n ( 3 ) ( y k ) , (14)

a ^ r s , k a r s , k | Y k = n = 0 { B 0 0 n C 0 0 0 , 1 + B 01 n C 01 0 , 1 } φ n ( 3 ) ( y k ) n = 0 B 0 0 n φ n ( 3 ) ( y k ) (15)

with

C 0 0 1 , 0 = x k , C 1 0 1 , 0 = Γ x k , C 0 0 0 , 1 = a r s , k * , C 01 0 , 1 = Γ a r s , k . (16)

Furthermore, the estimate of a r s , k for variance is derived as follows:

P a r s , k ( a r s , k a ^ r s , k ) 2 | Y k = n = 0 { B 0 0 n C 0 0 0 , 2 + B 01 n C 01 0 , 2 + B 02 n C 02 0 , 2 } φ n ( 3 ) ( y k ) n = 0 B 0 0 n φ n ( 3 ) ( y k ) (17)

with

C 0 0 0 , 2 = Γ a r s , k + ( a r s , k * a ^ r s , k ) 2 , C 01 0 , 2 = 2 Γ a r s , k ( a r s , k * a ^ r s , k ) , C 02 0 , 2 = 2 Γ a r s , k . (18)

Using the property of conditional expectation, (1) (2) and (7), the variables in (14) can be calculated as follows:

y k * = x k + v k | Y k 1 = x k * + v ¯ k , ( v ¯ k v k ) , (19)

Ω k = ( x k + v k x k * v ¯ k ) 2 | Y k 1 = Γ x k + R k , ( R k ( v k v ¯ k ) 2 ) , (20)

x k * = x k P ( x k | z k ) d x k | Y k 1 = r = 0 1 s = 0 d 1 r A r s θ s ( 2 ) ( z k ) | Y k 1 = r = 0 1 s = 0 d 1 r a r s , k * θ s ( 2 ) ( z k ) , (21)

Γ x k = ( x k x k * ) 2 P ( x k | z k ) d x k | Y k 1 = r = 0 2 s = 0 d 2 r A r s θ s ( 2 ) ( z k ) | Y k 1 = r = 0 2 s = 0 d 2 r a r s , k * θ s ( 2 ) ( z k ) , (22)

a r s , k * = a r s , k 1 | Y k 1 = a ^ r s , k 1 , (23)

Γ a r s , k = ( a r s , k 1 a ^ r s , k 1 ) 2 | Y k 1 = P a r s , k 1 . (24)

The coefficients d 1 r and d 2 r in (21) and (22) are determined in advance by expanding x k and ( x k x k * ) 2 in the orthogonal series of θ r ( 1 ) ( x k ) , as follows:

d 10 = λ 10 ( 1 ) / λ 11 ( 1 ) , d 11 = 1 / λ 11 ( 1 ) ,

d 20 = x k * 2 + ( λ 10 ( 1 ) / λ 11 ( 1 ) ) ( 2 x k * + λ 21 ( 1 ) / λ 22 ( 1 ) ) λ 20 ( 1 ) / λ 22 ( 1 ) ,

d 21 = ( 1 / λ 11 ( 1 ) ) ( 2 x k * + λ 21 ( 1 ) / λ 22 ( 1 ) ) , d 22 = 1 / λ 22 ( 1 ) . (25)

Furthermore, substituting (1) into (13) and using an additive theorem of Hermite polynomial:

( ξ 1 2 + ξ 2 2 + + ξ ς 2 ) n / 2 n ! H n ( ξ 1 X 1 + ξ 2 X 2 + + ξ ς X ς ξ 1 2 + ξ 2 2 + + ξ ς 2 ) = η 1 + η 2 + + η ς = n i = 1 ς ξ i η i η i ! H η i ( X i ) , (26)

the orthonormal polynomial φ n ( 3 ) ( y k ) can be expressed as follows:

φ n ( 3 ) ( y k ) = 1 n ! H n ( Γ x k ( x k x k * Γ x k ) + R k ( v k v ¯ k R k ) Γ x k + R k ) = 1 n ! i = 0 n ( n i ) ( Γ x k Ω k ) n i 2 ( R k Ω k ) i 2 H n i ( x k x k * Γ x k ) H i ( v k v ¯ k R k ) , (27)

Therefore, using (2) and (27), the expansion coefficient B l m n defined by (10) can be calculated as follows:

B l m n = 1 n ! i = 0 n ( n i ) ( Γ x k Ω k ) n i 2 ( R k Ω k ) i 2 φ m ( 2 ) ( a k ) φ l ( 1 ) ( x k ) H n i ( x k x k * Γ x k ) P ( x k | z k ) d x k | Y k 1 H i ( v k v ¯ k R k ) = 1 n ! i = 0 n ( n i ) ( Γ x k Ω k ) n i 2 ( R k Ω k ) i 2 φ m ( 2 ) ( a k ) r = 0 l + n i s = 0 d l + n i , r A r s θ s ( 2 ) ( z k ) | Y k 1 H i ( v k v ¯ k R k ) = i = 0 n n ! i ! ( n i ) ! ( Γ x k Ω k ) n i 2 ( R k Ω k ) i 2 r = 0 l + n i s = 0 d l + n i , r φ m ( 2 ) ( a k ) a r s , k | Y k 1 H i ( v k v ¯ k R k ) θ s ( 2 ) ( z k ) (28)

where d l + n i , r is appropriate coefficient that satisfies the following equality:

φ l ( 1 ) ( x k ) H n i ( x k x k * Γ x k ) = j = 0 l + n i d l + n i , j θ j ( 1 ) ( x k ) . (29)

From (19)-(22) and (28), the variables y k * , Ω k and the expansion coefficient B l m n in the estimation algorithms (14)-(18) are given by the measurement data of bone-conducted speech z k , estimates of parameter a r s at the discrete time k 1 and statistics of the surrounding noise v k . Therefore, the estimation of the speech signal can be performed by observing air-conducted speech y k in a recursive way.

The flow chart of the proposed speech signal detection algorithm is illustrated in Figure 1. As compared with the previously reported algorithm [10], time transition model for the speech signal is not required in the proposed algorithm and the calculation process of the algorithm can be fairly simplified.

Figure 1. Flow chart of the proposed signal detection algorithm.

3. Application to Real Speech Signal

In order to confirm the effectiveness of the proposed signal detection algorithm, it was applied to real speech signals. The speech signal data were measured in the anechoic chamber in the acoustic laboratory building of the West Region Industrial Research Centre, Hiroshima Prefectural Technology Research Institute. For a male and a female speech signals digitized with sampling frequency of 10 kHz and quantization of 16 bits, we estimated the speech signal based on the observation corrupted by additive noise. More specifically, we created noisy air-conducted speeches on a computer by mixing the original air-conducted speech signal measured in a noise-free environment with machine noise recorded in advance, as an example of actual surrounding noise. By setting the amplitude (i.e., mean squared value of instantaneous signal) of the noise to 1, 2, 3, 4, 5 and 10 times of that of the noise-free speech signals, we have applied the proposed algorithm to extremely difficult situations with low SNR. Furthermore, the bone-conducted speech was simultaneously measured by use of an acceleration sensor with the air-conducted speech. The noise-free air-conducted male speech signal and the created noisy air-conducted speech observation by using machine noise with the same amplitude as the noise-free speech signal are shown in Figure 2 and Figure 3, and the observed wave of the bone-conducted speech is shown in Figure 4. Furthermore, for the female speech signal, the noise-free air-conducted speech signal, noisy air-conducted speech observation and bone-conducted speech are respectively shown in Figures 5-7.

Figure 2. Noise free male speech signal.

Figure 3. Noisy air-conducted speech observation by using machine noise with the same amplitude as the noise-free male speech signal.

Figure 4. The observed wave of the bone-conducted male speech.

Figure 5. Noise-free female speech signal.

Figure 6. Noisy air-conducted speech observation by using machine noise with the same amplitude as the noise-free female speech signal.

Figure 7. The observed wave of the bone-conducted female speech.

The estimated results by using the algorithm based on (14)-(18) are shown in Figure 8 for the male speech signal and in Figure 9 for the female speech signal. For comparison, the estimated results of the male and female speech signals by using the estimation algorithm based on only the observation of air-conducted speech are shown in Figure 10 and Figure 11.

Furthermore, the estimated results by the previously reported method [10] are shown in Figure 12 for the male speech signal and in Figure 13 for the female speech signal.

By comparing Figure 8, Figure 10 and Figure 12 with the original male speech signal shown in Figure 2, and comparing Figure 9, Figure 11, Figure 13 with Figure 5, it is obvious that the proposed method can suppress the effects of real machine noise better than the method based on observation of only air-conducted speech and the previously reported method.

Figure 8. Estimated male speech signal by use of the proposed method.

Figure 9. Estimated female speech signal by use of the proposed method.

Figure 10. Estimated male speech signal by use of the method based on only the observation of air-conducted speech.

Figure 11. Estimated female speech signal by use of the method based on only the observation of air-conducted speech.

Figure 12. Estimated male speech signal by use of the previous method.

Figure 13. Estimated female speech signal by use of the previous method.

Table 1. Performance comparisons for a male speech signal contaminated by machine noise.

Table 2. Performance comparisons for a female speech signal contaminated by machine noise.

The estimation RMS (root mean square) error and the PEI (performance evaluation index) defined by

RMSError = 1 N k = 1 N ( x k x ^ k ) 2 , (30)

PEI = 10 log 10 ( x k 2 ( x k x ^ k ) 2 ) [ dB ] . (31)

are shown in Table 1 (the male speech signal) and Table 2 (the female speech signal).

Furthermore, the computation time of the proposed method was reduced by 39.3% of the previous method. From these results, the improved effectiveness of the proposed method in the simplified algorithm with the aid of bone-conducted speech can be clearly noticed in comparison with the estimation by the compared method based on the observation of only air-conducted speech and the previous method in the complicated algorithm.

4. Conclusions

4.1. Novel Contribution

In this study, a new method to detect speech signals under the existence of surrounding noise has been proposed from the viewpoint of Bayesian estimation by observing air-conducted speech with the aid of measurement of bone-conducted speech. Furthermore, it has been revealed by experiments that the proposed method is more effective than the method based on the observation of only air-conducted speech and the previous method in the complicated algorithm, to remove the surrounding noise in real noise environment.

4.2. Future Researches

The proposed approach is quite different from the traditional standard techniques. However, we are still in an early stage of development, and a number of practical problems are yet to be investigated in the future. These include: 1) application to a diverse range of speech signals in actual noise environment; 2) extension to cases with multi-noise sources; 3) finding an optimal number of expansion terms for the expansion-based probability expression adopted; and 4) improvement of estimation precision by considering higher order statistics of surrounding noise.

Acknowledgements

The authors are grateful to Mr. Daishi Takagi for his help during this study. This work was supported in part by the fund from the Grant-in-Aid for Scientific Research No.19 K04428 from the Ministry of Education, Culture, Sports, Science and Technology-Japan.

Cite this paper: Orimoto, H. , Ikuta, A. and Hasegawa, K. (2021) Speech Signal Detection Based on Bayesian Estimation by Observing Air-Conducted Speech under Existence of Surrounding Noise with the Aid of Bone-Conducted Speech. Intelligent Information Management, 13, 199-213. doi: 10.4236/iim.2021.134011.
References

[1]   Boll, S.F. (1979) Suppression of Acoustic Noise in Speech Using Spectral Subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27, 113-120.
https://doi.org/10.1109/TASSP.1979.1163209

[2]   Virag, N. (1999) Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System. IEEE Transactions on Speech and Audio Processing, 7, 126-137.
https://doi.org/10.1109/89.748118

[3]   Kaneda, Y. and Ohga, J. (1986) Adaptive Microphone-Array System for Noise Reduction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34, 1391-1400.
https://doi.org/10.1109/TASSP.1986.1164975

[4]   Kawamura, A., Fujii, K., Itoh, Y. and Fukui, Y. (2002) A Noise Reduction Method Based on Linear Prediction Analysis. IEICE Transactions on Fundamentals, J85-A, 415-423.

[5]   Kawamura, A, Fujii, K. and Itoh, Y. (2005) A Noise Reduction Method Based on Linear Prediction with Variable Step-Size. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E88-A, 855-861.
https://doi.org/10.1093/ietfec/e88-a.4.855

[6]   Gabrea, M. Griel, E. and Najim, M. (1999) A Single Microphone Kalman Filter-Based Noise Canceller. IEEE Signal Processing Letters, 6, 55-57.
https://doi.org/10.1109/97.744623

[7]   Kim, W. and Ko, H. (2001) Noise Variance Estimation for Kalman Filtering of Noisy Speech. IEICE Transactions on Information and Systems, E84-D, 155-160.

[8]   Tanabe, N. Furukawa, T. and Tsuji, S. (2008) Robust Noise Suppression Algorithm with the Kalman Filter Theory for White and Colored Disturbance. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E91-A, 818-829.
https://doi.org/10.1093/ietfec/e91-a.3.818

[9]   Ikuta, A. and Orimoto, H. (2011) Adaptive Noise Suppression Algorithm for Speech Signal Based on Stochastic System Theory. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E94-A, 1618-1627.
https://doi.org/10.1587/transfun.E94.A.1618

[10]   Ikuta, A. Orimoto, H. and Gallagher, G. (2018) Noise Suppression Method by Jointly Using Bone- and Air-Conducted Speech Signals. Noise Control Engineering Journal, 66, 472-488.
https://doi.org/10.3397/1/376640

[11]   Shin, H.S. Kang, H.G. and Fingscheidt, T. (2012) Survey of Speech Enhancement Supported by a Bone Conduction Microphone. Proceedings of 10th ITG Conference on Speech Communication, Braunschweig, January 2012, 47-50.

[12]   Ohta, M. and Yamada. H. (1984) New Methodological Trials of Dynamical State Estimation for the Noise and Vibration Environmental System—Establishment of General Theory and Its Application to Urban Noise Problems. Acta Acustica United with Acustica, 55, 199-212.
https://www.ingentaconnect.com/content/dav/aaua/1984/00000055/00000004/art00003

[13]   Ikuta, A. Tokhi, M.O. and Ohta, M. (2011) A Cancellation Method of Background Noise for a Sound Environment System with Unknown Structure. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E84-A, 457-466.

[14]   Ohta, M. and Koizumi, T. (1968) General Statistical Treatment of the Response of a Nonlinear Rectifying Device to a Stationary Random Input (Corresp.). IEEE Transactions on Information Theory, 14, 595-598.
https://doi.org/10.1109/TIT.1968.1054178

 
 
Top