Ascertaining Person Behind the Voice Seamlessly

Background

When I transact online today, there is always a need for me to authenticate myself. Be it to see my mail, be it to check my account balance or order a pair of shoes for my son. If the mortar days, you visited an enterprise between a fixed working hours, at a specified location and interacted with a set of people representing the enterprise in person. The only form of authentication would be your signature or your thumb impression; however informally the authentication would be at a deeper level. For one the person representing the enterprise actually got to look at your face, your body language and spot if anything was amiss. The signature or the thumb impression on a piece of paper was more of a formal book keeping exercise, just in case it was required in future to verify a transaction.

As the enterprises broke the barrier so as to allow their customers to do anything anytime, it was welcomed with wide and open arms by everyone. It was convenient; one did not have to travel; did not have to plan to visit during a certain time to do a transaction. But this came with one major problem; how is the person on the other end to be authenticated. It started with choose what ever you want as your password, if one transacted on the web, but very soon went on to put several restriction on what could be your password – can not be a dictionary word, use atleast one number, use capital letter plus a number, atleast 8 digits. All this to make sure no one else impersonated and tried to transact as you. When people questioned, if I was transacting with the enterprise that I want to transact with actually gave raise to two factor authentication and so on. Strong, unambiguous, robust authentication was required because there was no physical presence that was one of the main aspects of a mortar transaction.

As the digital space widened, people presence in the digital space increased. People has to remember several passwords. They had two options, keep the same password everywhere and ride on the risk of lost password would unlock all your digital space or have a different password and stand the risk of forgetting passwords. What if one used something very specific to oneself as the password? Then there would be no situation of a forgotten password because one did not need to remember it anyway. Step in biometrics. Biometrics became popular, when in doubt one popped up a question related to you, like your date of birth or you place of birth or your mothers maiden name – use of biometric was and is still used effectively to retrieve our forgotten password.

I like to Talk

Though there are several digital channels, like, telephone, email, web FAQ and SMS that allow a customer to interact with an enterprise. Some studies indicate that a majority of the customers choose the telephone channel to interact for any customer care support. Especially customers like to speak to the human-agent (see Figure below).

Figure 1. Customers Like to Speak to the Agent!

Further it has been observed that there is a higher customer satisfaction for a phone based interaction (see Figure 2) compared to other existing interaction channels. While there could be several reasons for this, it is not hard to guess the possible reasons for a higher customer satisfaction in a telephone interaction channel

Telephone channel is more personal,
Telephone channel enables live and real-time interaction with the agent at the other end.
Speech is the most natural mode of communication to sort problems
It is a synchronous transaction
Closest to a face to face interaction.

Figure 2. Customer Satisfaction

When you transact on the telephone two things happen, you are doing your transactions live and in real time and at the same time you are not bodily present with the person serving you. So the person representing the enterprise has to ascertain that you are indeed the person you are claiming to be without actually getting to see you – authentication became very crucial. The telephone channel of communication meant that your voice is the only cue he can sense as he interacts with you.

One of the easiest thing to do, which was done by several enterprises and which continues to be done was to ask for your data of birth, which you either spoke or entered through the DTMF of the telephone keypad. The person on the other side heard your biometric information and compared it with your data that was available with the enterprise to ascertain that you are indeed who you claim to be. While this was good, it was not sufficiently robust, what if people somehow get to know your date of birth and used it to impersonating you to gain access to your information. Additionally, enterprises started looking at technology that allowed their customers to help themselves rather than have them speak to a human agent. The presence of technology and the cost implications probably were the two main reasons to go the self help way.

Help Yourself

Self help allowed you to interact with a machine, except that the machine drove the interaction. Popularly called the IVR based self help systems. “Please choose an option. Press “1” for .. Press “2” for ...” and it responded to key presses; when there was a need to authenticate you it asked you to use the telephone keypad to enter your date of birth in a certain DDMMYY format.

However this interaction was unreal without a human touch, guess we missed our natural instinct to speak to resolve a problem! Eventually you found that there was a Press “9” or a Press “*” to talk to our customer representative.

What started off from the enterprise perspective to allow you to help yourself and there by lower the cost of having an actual human talk to you failed to take off; when people had to use the self-service because of long wait in the queue to speak to the agent they did so but grudgingly and the customer satisfaction plummeted.

Although self help channel still existed, it was the conversation with the call center agent that was more popular. Enterprises jacked up their infrastructure, tried to find ways to reduce the transactional time so that more customers could be serviced in the same amount of time etc. Could Speech recognition technology help?

Talk and Help Yourself

Speech recognition technology made it possible for you to speak to do your transaction. The idea was that the machine which was interacting with you still took hold of the interaction but it allowed you to speak rather than ask you to press the DTMF key on the keyboard. So instead of “Press 1 for …. Press 2 for ...” you now were asked “Speak /cheque/ to order a fresh cheque book. Speak /account balance/ to know the balance in your account”. This was far better than the Press “1” for …. kind of self help.

One of the very important prerequisite in this kind of a voice based self help was the need to authenticate the speaker based on his or her voice. Enter Speech biometrics.

What is in What I Speak

There are essentially three components that are embedded when we speak. When I speak /good morning you are reading my article/ there are three things that can be sensed by the listener. The firs is the content of the speech, namely the text that I spoke (“Good Morning. You are reading my article”).

The first is the Linguistic information, which is primarily what is said. It could be the language in which it was spoken (English in this case) and gives an idea of what was said (the equivalent of speaking a written text).

In addition, it also carries some non-linguistic information like who spoke (me in this case!), how I spoke (hopefully pleasantly!), what is the emotional state of mine when I spoke it, gender of the speaker, the accent, the articulation clarity. Some of these are purposely added by the speaker and hence are controllable while some are inherent to the speaker and can not be purposely be controlled by the speaker.

Speech biometric exploits to a large extent all the non-linguistic information in spoken speech to ascertain the identity of the speaker. Essentially, there are two broad ways in which speech biometric is used, one is in the form of speaker verification, while the other is speaker identification.

Speaker verification is the process where the speaker claims he is X and the system uses an a priori model of the claimed person to verify the genuineness of the claim. The system measures the closeness of the speakers speech to the model of speech associated with that speaker and if the speech is close (based on a preset threshold) it validates the identity or denies validation (forgery). In a speaker verification system, the decision is binary in the sense that once the speech of a speaker is compared to the speaker model the decision is one of YES or NO. This decision is usually based on a threshold value. The choice of threshold is determined by the type of application. Primarily there are two contradicting measures which have to be kept low for the success of a speaker verification system, namely false acceptance ratio (FAR) and false rejection ratio (FRR). While the FAR refers to the percentage of forged attempts that have been validated by the system, FRR refers to the performance of the system as not being correctly able to validate genuine users. The choice of threshold has opposite effect on FAR and FRR. A choice of threshold might minimize FAR or FRR and not both simultaneously. In brief, a threshold which has a good effect on FAR (lowering) has the reverse effect on FRR (increasing) and vice-versa. A very secure system would choose a threshold such that it would have zero tolerance to FAR at the cost of an increased FRR (rejecting genuine speakers).

A quick browse of Wikipedia tells that, speaker verification systems come in various forms based on the amount of security that is expected from the verification system. Broadly we can categorize the speaker verification systems into four categories, namely, (a) Fixed Phrase system where a pre-determined phrase is used for verification, (b) Fixed Vocabulary verification system is more flexible and practical; training and testing materials for a speaker are generated based on words of a fixed vocabulary. The user speaks the words or phrases from this vocabulary both at the time of enrollment (or training) as well as when testing. (c) Flexible Vocabulary system use a general set of sub-word phone models which are created during speaker model training (d) Text-Independent system, where the user is not constrained to say fixed or prompted phrases. He has the freedom to speak anything and the system works by extracting the speaker characteristics from the spoken utterance. Clearly, both complexity and security increases as we go from fixed phrase to text-independent.

On the other hand, speaker identification is the process of identifying the person behind the voice without the person specifying who (s)he is.. For example, in banking scenario, if the banks customer base was N unique people, then when a user calls, a speaker identification platform would have to first determine if the speaker is one among the N people that it knows and then if the person speaking is one of the N then it has to identify which of the N is the speaker. This is essentially a problem of choosing one among the N+1 possible identities. N for the people and the 1 represent not among the N people!

Clearly, speaker identification is more challenging technologically than the speaker verification system; as a result most forward looking enterprises are looking at speaker verification system only.

Using Speech Biometrics

Services as an industry has had a significant growth in the recent times. With this growth there has been a need to service a large number of clients which is on one hand very important but on other hand a significant cost to the service industry. Many service industries have tried mechanism so as not to dent the customer satisfaction while trying to minimize the cost of providing a channel for its customers to interact with them. Automated self help based solutions are the front runners and the telephone (IVR) channel is the most sought medium of interaction.

Enterprises can provide personalized and accurate information if they know whom they are servicing and one of the non-intrusive method to identify the person on the telephone channel is through spoken speech. In this context speech biometrics becomes very important.

Speech signal as a biometric is gaining significance in recent times. Speech biometric is probably the most non-intrusive and natural biometric to verify the identity of a person. The use of speech biometric even in the speaker verification mode is plagued by the well known fact that it is not cent percent right cent percent of the time (specifically when only speech is used as the cue). So there has been certain reluctance to use it in practice. However there has been several commercial speech biometric solutions available in the market claiming usable accuracies of verification, in almost all the cases they use more than the speech cue to verify the identity of the speaker. For example, one of the important cue would be the need for the user to authenticate himself only from his registered mobile phone number.

Irrespective of the performance accuracies of the speaker verification system based on the speech cue, most implementation today speak of replacing the authentication process by speech biometric system. For example, today when I call a service provider, after the initial welcome message (“Welcome to …, How may I help you today?”) you are posed by a simple looking question by the human agent (“Am I speaking to …... can I know your date of birth please”) which is what is used to verify my identity.

In a self help kind of a scenario, the authentication is much more direct. Sensing my registered mobile number, the system knows who I am supposed to be; the system blurts out “Please speak your password”. I speak my password, most probably a password that I resisted with my service provider. If I am verified I am allowed to do the rest of the transaction else I am asked to leave or if the service provider is courteous I am put in a queue to be helped by a human agent.

While there is nothing wrong with the use of speech biometric as a replacement for the actual human authentication process. Clearly, there is a customer dissatisfaction, imagine a machine refusing to identify me and then putting me in a queue to speak to an agent. If this happens several time (two to be exact!) I will not opt for anything other than speak to the agent!

The question is can I use speech biometric, with its known problems of not being accurate all the time such that it is transparent to the user so that (s)he does not get a feel that (s)he is being verified at all.

Invisible Speech Biometrics

There are two things that happen when one used speech biometrics in the form that it is envisioned to be used. It not only disturbs the flow of interaction, the speech biometric authentication step stands itself as a process (eating away precious time to just authenticate the speaker); additionally (a) it is a point authentication, meaning the authentication happens at one given point of time (what if the phone changes hands after the initial authentication?) and (b) there is a customer dissatisfaction if the speech biometric system fails to recognize a genuine user. Can we do better?

The author in his ignorance, anticipates the use of speech biometric to handle all the drawbacks of a regular speech biometric system (see Figure 3). The idea is that of continuous authentication from the beginning till the end of the call there by eliminating an explicit “authentication process”. See the marked “Time Jump” in the Figure 3; which means there is no need to explicitly have an authentication phase in the call interaction. The continuous, in the background speaker verification means that the process of authentication is invisible to the customer and hence the chance of dissatisfaction does not arise. The third issue of phone changing hands in midst of the transaction could still be identified.

Figure 3. Continuous Authentication

Additionally this process of invisible, continuous speaker verification means that the enterprise can still give me access to certain things even if it verifies me with with low confidence – for example if my transaction is just to verify the change of address that I had sought … then the enterprise can give me the information even if it does not verify me with high confidence … but when I want to transfer some money it might want to be able to verify my identify with higher confidence. This gives the enterprise the scope to only verify me when it is required but also decide what degree of confidence is required for what kind of information.

Conclusion

Speech biometric can really make inroads into actual use if it is used with its known limitation and the understanding of the positive or negative impact that it can have on the user. The speech and natural language team in TCS Innovation Labs – Mumbai is exploring on how best to provide a speech biometric into an already existing setup where one could avoid the point authentication process by continuously verifying the speaker. We believe a continuous rather than a point authentication is going to be the way forward because it addresses the challenging that come with the current though process in implementing speech biometric in a call center or otherwise.

Thoughts

Search This Blog