Background
When I transact online
today, there is always a need for me to authenticate myself. Be it to
see my mail, be it to check my account balance or order a pair of
shoes for my son. If the mortar days, you visited an enterprise
between a fixed working hours, at a specified location and interacted
with a set of people representing the enterprise in person. The only
form of authentication would be your signature or your thumb
impression; however informally the authentication would be at a
deeper level. For one the person representing the enterprise actually
got to look at your face, your body language and spot if anything was
amiss. The signature or the thumb impression on a piece of paper was
more of a formal book keeping exercise, just in case it was required
in future to verify a transaction.
As the enterprises broke
the barrier so as to allow their customers to do anything anytime, it
was welcomed with wide and open arms by everyone. It was convenient;
one did not have to travel; did not have to plan to visit during a
certain time to do a transaction. But this came with one major
problem; how is the person on the other end to be authenticated. It
started with choose what ever you want as your password, if one
transacted on the web, but very soon went on to put several
restriction on what could be your password – can not be a
dictionary word, use atleast one number, use capital letter plus a
number, atleast 8 digits. All this to make sure no one else
impersonated and tried to transact as you. When people questioned, if
I was transacting with the enterprise that I want to transact with
actually gave raise to two factor authentication and so on. Strong,
unambiguous, robust authentication was required because there was no
physical presence that was one of the main aspects of a mortar
transaction.
As the digital space
widened, people presence in the digital space increased. People has
to remember several passwords. They had two options, keep the same
password everywhere and ride on the risk of lost password would
unlock all your digital space or have a different password and stand
the risk of forgetting passwords. What if one used something very
specific to oneself as the password? Then there would be no situation
of a forgotten password because one did not need to remember it
anyway. Step in biometrics. Biometrics became popular, when in doubt
one popped up a question related to you, like your date of birth or
you place of birth or your mothers maiden name – use of biometric
was and is still used effectively to retrieve our forgotten password.
I like to Talk
Though there are several
digital channels, like, telephone, email, web FAQ and SMS that allow
a customer to interact with an enterprise. Some studies indicate that
a majority of the customers choose the telephone channel to interact
for any customer care support. Especially customers like to speak to
the human-agent (see Figure below).
Figure 1. Customers Like to Speak to the Agent! |
Further it has been
observed that there is a higher customer satisfaction for a phone
based interaction (see Figure 2) compared to other existing
interaction channels. While there could be several reasons for this,
it is not hard to guess the possible reasons for a higher customer
satisfaction in a telephone interaction channel
- Telephone channel is more personal,
- Telephone channel enables live and real-time interaction with the agent at the other end.
- Speech is the most natural mode of communication to sort problems
- It is a synchronous transaction
- Closest to a face to face interaction.
Figure 2. Customer Satisfaction |
When you transact on the
telephone two things happen, you are doing your transactions live and
in real time and at the same time you are not bodily present with the
person serving you. So the person representing the enterprise has to
ascertain that you are indeed the person you are claiming to be
without actually getting to see you – authentication became very
crucial. The telephone channel of communication meant that your voice
is the only cue he can sense as he interacts with you.
One of the easiest thing
to do, which was done by several enterprises and which continues to
be done was to ask for your data of birth, which you either spoke or
entered through the DTMF of the telephone keypad. The person on the
other side heard your biometric information and compared it with your
data that was available with the enterprise to ascertain that you are
indeed who you claim to be. While this was good, it was not
sufficiently robust, what if people somehow get to know your date of
birth and used it to impersonating you to gain access to your
information. Additionally, enterprises started looking at technology
that allowed their customers to help themselves rather than have them
speak to a human agent. The presence of technology and the cost
implications probably were the two main reasons to go the self help
way.
Help Yourself
Self help allowed you to
interact with a machine, except that the machine drove the
interaction. Popularly called the IVR based self help systems.
“Please choose an option. Press “1” for .. Press “2” for
...” and it responded to key presses; when there was a need to
authenticate you it asked you to use the telephone keypad to enter
your date of birth in a certain DDMMYY format.
However this interaction
was unreal without a human touch, guess we missed our natural
instinct to speak to resolve a problem! Eventually you found that
there was a Press “9” or a Press “*” to talk to our customer
representative.
What started off from the
enterprise perspective to allow you to help yourself and there by
lower the cost of having an actual human talk to you failed to take
off; when people had to use the self-service because of long wait in
the queue to speak to the agent they did so but grudgingly and the
customer satisfaction plummeted.
Although self help
channel still existed, it was the conversation with the call center
agent that was more popular. Enterprises jacked up their
infrastructure, tried to find ways to reduce the transactional time
so that more customers could be serviced in the same amount of time
etc. Could Speech recognition technology help?
Talk and Help Yourself
Speech recognition
technology made it possible for you to speak to do your transaction.
The idea was that the machine which was interacting with you still
took hold of the interaction but it allowed you to speak rather than
ask you to press the DTMF key on the keyboard. So instead of “Press
1 for …. Press 2 for ...” you now were asked “Speak /cheque/ to
order a fresh cheque book. Speak /account balance/ to know the
balance in your account”. This was far better than the Press “1”
for …. kind of self help.
One of the very important
prerequisite in this kind of a voice based self help was the need to
authenticate the speaker based on his or her voice. Enter Speech
biometrics.
What is in What I Speak
There are
essentially three components that are embedded when we speak. When I
speak /good morning you are reading my article/ there are three
things that can be sensed by the listener. The firs is the content of
the speech, namely the text that I spoke (“Good Morning. You are
reading my article”).
The
first is the Linguistic information, which is primarily what is said.
It could be the language in which it was spoken (English in this
case) and gives an idea of what was said (the equivalent of speaking
a written text).
In
addition, it also carries some non-linguistic information like who
spoke (me in this case!), how I spoke (hopefully pleasantly!), what
is the emotional state of mine when I spoke it, gender of the
speaker, the accent, the articulation clarity. Some of these are
purposely added by the speaker and hence are controllable while some
are inherent to the speaker and can not be purposely be controlled by
the speaker.
Speech
biometric exploits to a large extent all the non-linguistic
information in spoken speech to ascertain the identity of the
speaker. Essentially, there are two broad ways in which speech
biometric is used, one is in the form of speaker verification, while
the other is speaker identification.
Speaker
verification is the process where the speaker claims he is X and the
system uses an a priori model of the claimed person to verify the
genuineness of the claim. The system measures the closeness of the
speakers speech to the model of speech associated with that speaker
and if the speech is close (based on a preset threshold) it validates
the identity or denies validation (forgery). In a speaker
verification system, the decision is binary in the sense that once
the speech of a speaker is compared to the speaker model the decision
is one of YES or NO. This decision is usually based on a threshold
value. The choice of threshold is determined by the type of
application. Primarily there are two contradicting measures which
have to be kept low for the success of a speaker verification system,
namely false acceptance ratio (FAR) and false rejection ratio (FRR).
While the FAR refers to the percentage of forged attempts that have
been validated by the system, FRR refers to the performance of the
system as not being correctly able to validate genuine users. The
choice of threshold has opposite effect on FAR and FRR. A choice of
threshold might minimize FAR or FRR and not both simultaneously. In
brief, a threshold which has a good effect on FAR (lowering) has the
reverse effect on FRR (increasing) and vice-versa. A very secure
system would choose a threshold such that it would have zero
tolerance to FAR at the cost of an increased FRR (rejecting genuine
speakers).
A
quick browse of Wikipedia tells that, speaker verification systems
come in various forms based on the amount of security that is
expected from the verification system. Broadly we can categorize the
speaker verification systems into four categories, namely, (a) Fixed
Phrase system where a pre-determined phrase is used for verification,
(b) Fixed Vocabulary verification system is more flexible and
practical; training and testing materials for a speaker are generated
based on words of a fixed vocabulary. The user speaks the words or
phrases from this vocabulary both at the time of enrollment (or
training) as well as when testing. (c) Flexible Vocabulary system use
a general set of sub-word phone models which are created during
speaker model training (d) Text-Independent system, where the user is
not constrained to say fixed or prompted phrases. He has the freedom
to speak anything and the system works by extracting the speaker
characteristics from the spoken utterance. Clearly, both complexity
and security increases as we go from fixed phrase to
text-independent.
On
the other hand, speaker identification is the process of identifying
the person behind the voice without the person specifying who (s)he
is.. For example, in banking scenario, if the banks customer base
was N unique people, then when a user calls, a speaker identification
platform would have to first determine if the speaker is one among
the N people that it knows and then if the person speaking is one of
the N then it has to identify which of the N is the speaker. This is
essentially a problem of choosing one among the N+1 possible
identities. N for the people and the 1 represent not among the N
people!
Clearly,
speaker identification is more challenging technologically than the
speaker verification system; as a result most forward looking
enterprises are looking at speaker verification system only.
Using Speech Biometrics
Services
as an industry has had a significant growth in the recent times. With
this growth there has been a need to service a large number of
clients which is on one hand very important but on other hand a
significant cost to the service industry. Many service industries
have tried mechanism so as not to dent the customer satisfaction
while trying to minimize the cost of providing a channel for its
customers to interact with them. Automated self help based solutions
are the front runners and the telephone (IVR) channel is the most
sought medium of interaction.
Enterprises
can provide personalized and accurate information if they know whom
they are servicing and one of the non-intrusive method to identify
the person on the telephone channel is through spoken speech. In this
context speech biometrics becomes very important.
Speech
signal as a biometric is gaining significance in recent times. Speech
biometric is probably the most non-intrusive and natural biometric to
verify the identity of a person. The use of speech biometric even in
the speaker verification mode is plagued by the well known fact that
it is not cent percent right
cent percent of the time (specifically
when only speech is used as
the cue). So there has been certain reluctance to use it in
practice. However there has been several commercial speech biometric
solutions available in the market claiming usable accuracies of
verification, in almost all the cases they use more than the speech
cue to verify the identity of the speaker. For example, one of the
important cue would be the need for the user to authenticate himself
only from his registered mobile phone number.
Irrespective
of the performance accuracies of the speaker verification system
based on the speech cue, most implementation today speak of replacing
the authentication process by speech biometric system. For example,
today when I call a service provider, after the initial welcome
message (“Welcome to …, How may I help you today?”) you are
posed by a simple looking question by the human agent (“Am I
speaking to …... can I know your date of birth please”) which is
what is used to verify my identity.
In
a self help kind of a scenario, the authentication is much more
direct. Sensing my registered mobile number, the system knows who I
am supposed to be; the system blurts out “Please speak your
password”. I speak my password, most probably a password that I
resisted with my service provider. If I am verified I am allowed to
do the rest of the transaction else I am asked to leave or if the
service provider is courteous I am put in a queue to be helped by a
human agent.
While
there is nothing wrong with the use of speech biometric as a
replacement for the actual human authentication process. Clearly,
there is a customer dissatisfaction, imagine a machine refusing to
identify me and then putting me in a queue to speak to an agent. If
this happens several time (two to be exact!) I will not opt for
anything other than speak to the agent!
The
question is can I use speech biometric, with its known problems of
not being accurate all the time such that it is transparent to the
user so that (s)he does not get a feel that (s)he is being verified
at all.
Invisible Speech Biometrics
There
are two things that happen when one used speech biometrics in the
form that it is envisioned to be used. It not only disturbs the flow
of interaction, the speech biometric authentication step stands
itself as a process (eating away precious time to just authenticate
the speaker); additionally (a) it is a point authentication, meaning
the authentication happens at one given point of time (what if the
phone changes hands after the initial authentication?) and (b) there
is a customer dissatisfaction if the speech biometric system fails to
recognize a genuine user. Can we do better?
The
author in his ignorance, anticipates the use of speech biometric to
handle all the drawbacks of a regular speech biometric system (see
Figure 3). The idea is that of continuous authentication from the
beginning till the end of the call there by eliminating an explicit
“authentication process”. See the marked “Time Jump” in the
Figure 3; which means there is no need to explicitly have an
authentication phase in the call interaction. The continuous, in the
background speaker verification means that the process of
authentication is invisible to the customer and hence the chance of
dissatisfaction does not arise. The third issue of phone changing
hands in midst of the transaction could still be identified.
Figure 3. Continuous Authentication |
Additionally
this process of invisible, continuous speaker verification means that
the enterprise can still give me access to certain things even if it
verifies me with with low confidence – for example if my
transaction is just to verify the change of address that I had sought
… then the enterprise can give me the information even if it does
not verify me with high confidence … but when I want to transfer
some money it might want to be able to verify my identify with higher
confidence. This gives the enterprise the scope to only verify me
when it is required but also decide what degree of confidence is
required for what kind of information.
Conclusion
Speech
biometric can really make inroads into actual use if it is used with
its known limitation and the understanding of the positive or
negative impact that it can have on the user. The speech and natural
language team in TCS Innovation Labs – Mumbai is exploring on how
best to provide a speech biometric into an already existing setup
where one could avoid the point authentication process by
continuously verifying the speaker. We believe a continuous rather
than a point authentication is going to be the way forward because it
addresses the challenging that come with the current though process
in implementing speech biometric in a call center or otherwise.
Comments