At best a speech signal can be best described as
indiaisthelargestdemocracywelcometoindia
and in reality it is
either
indiaesthelarzestdemocracyvelcometwondia or indiaes thelarzestdemo cracyvel cometw ondia
Those of you who have paid attention to the 40 character signal would be able to see that it is actually
India is the largest democracy. Welcome to India.
This essentially is the difference between the input's seen in a text processing pipeline versus a speech processing pipeline. For this reason several ask to the speech processing community is (Speech Recognition)
Can you give us "India is the largest democracy. Welcome to India." from the speech signal "indiaes thelarzestdemo cracyvel cometw ondia"?
so that we can go on with out text processing pipeline and do all the glittery stuff in Natural Language Processing.
A speech processing researcher is looking at "indiaes thelarzestdemo cracyvel cometw ondia" or "indiaesthelarzestdemocracyvelcometwondia" while a text processing researcher is looking at "India is the largest democracy. Welcome to India."
The differences "capitalized alphabets, the punctuation's, the word separators, ..." are a boon, served on a silver plate, to a text processing engineer. While they are to be deciphered, with significant effort, from a speech signal for a speech engineer.
As an example, just imagine how easy it is to identify the number of times "India" occurs in a text signal versus trying to identify "India" in the speech signal "indiaes thelarzestdemo cracyvel cometw ondia" (Keyword spotting).
But a speech signal come with a lot more information than just the linguistic content. It has information about the speaker gender (male, female), speaker age (adult, child), speaker identity (Sunil, ...), speaker state (happy, neutral, stressed, ....), speaker accent (Indian, Australian, ...), speaker dialect, speaker health (dysarthric, ...) ... which most often does not accompany the text signal "India is the largest democracy. Welcome to India."
However for any transaction to happen between a machine and a human the crucial step is one of trying to get the linguistic content in the speech. For example, a robot in an airport trying to answer questions posed by the travelers requires the robot to convert speech into text perfectly. And for this reason the priority and focus, in the speech community, has been on exploiting only the linguistic content in the speech signal (Speech Recognition) and not so much on any other information that is abundantly and uniquely present only in the speech signal.
A robot that tries to answer information about Bangalore airport in India. The first part is the ability to understand what the human is "asking", namely to convert speech signal into text.
For this reason it is natural that concepts and algorithms that pass the test or performance on text strings (signal) are adopted (given a try!) in speech signal processing.
For those of who who are curious. For a visual feel as to why speech signal is complex, please see complexity of speech.
Comments