Often it is difficult to emphasize the difficulty that one faces during speech signal processing. Thanks to the large population use of speech recognition in the form of Alexa, Google Home when most of us are asking for a very limited information ("call my mother", "play the top 50 international hits" or "switch off the lights") which is quite well captured by the speech recognition engine in the form of contextual knowledge (it knows where you are; it knows your calendar, it know you parents phone number, it knows your preference, it knows your facebook likes .... ).
Same Same - Different Different: You speak X = /My voice is my password/ and I speak Y= /My voice is my password/. In speech recognition both our speech samples (X and Y) need to be recognized as "My voice is my password" while in speaker biometric X has to be attributed to you and and Y has to be attributed to me!
In this blog post we try to show visually what it means to process speech.
Speech when spoken by the same person but at different time of the day and on different days is very different. So a speech processing technique should be able to recognize all the different instances of the same word (in this case /clean speech/). The complexity of the technique is further tested when the same word is spoken by different people with different accents etc as shown below.
Environmental noise is common. Be it office (keyboard, telephone calls) or home (TV is on) or electrical noise. So further any speech processing should work in noisy conditions. The below figure captures what it means to visualize an instance of spoken speech in presence of varying noise. Note that the noise (relative to the speech signal) increases as we go from row 1 to row 3 in the figure below. Clearly, recognizing "clean speech" in the third row of the below picture is much tougher that recognizing "clean speech" as seen in the first row of the below figure. To recognize the same utterance spoken by different people (above figure) in presence of noise is much tougher. Visualize superimposition of the noise (Figure below) and the variation because different speakers are speaking (Figure above)
The same word (in this case "clean speech") when spoken by different people is different. The first row in the figure below captures what it might look like when processing speech from different gender and children. A notable aspect is that the speech signal significantly changes for the elderly or people with pathological conditions (example, people affected by Parkinson's or any neurological condition!).
Speech also changes both in intensity (captured by the size of "clean speech" in row 2 in the below figure) and with the emotional condition of the speaker. A representative representation of the same word ("clean speech") when spoken with different emotion is captured in row 2 in the below figure. Speaking rate (the number of words we speak per minute) is another aspect in speech that imposes challenges in speech processing (The third row in the below figure captures speaking rate variation). A good speech processing algorithm is able recognize the word "clean speech" in these different scenarios! Tough.
Some really testing conditions under which your speech processing needs to work in what is called cocktail noise | babble noise (visually captured in the first row of figure below). This condition often occurs during a party when there is discussion in the air and one needs to fixate attention to one speaker. The task of speech recognition would be to recognize "clean speech" in midst of all the people "babble" speaking. The first row of the figure below visually captures babble noise scenario.
Another important aspect of noise is reverberation noise (depicted in the second row of the figure above). The speech signal is contaminated by reverberation (or reflection of the spoken speech from the walls of the room). The task of a speech engine is to recognize "clean speech" in midst of noise as seen in row two of the figure above.
These visual examples should give an idea of what it is to recognize the word "clean speech" from a spoken utterance in different scenarios. The variability due to different speakers, the variability due to the emotional states, the variability due to gender are one aspect and the second aspect is that of noise, be it environmental, babble or reverberation.
Comments