Wednesday, 6 June 2012

Voice recognition Human VS Computer

I am sure many will have seen the lovely Siri adverts, "Did mum send me that recipe?" and the amazing super powered Siri shows the recipe and leaves you with the impression he would probably get up and cook the thing for you if you asked him politely.

While it may appear to make humans redundant we are certainly a long way from the Rise of the Machines.

Dear aunt, let's set so double the killer delete select all.

Yes the "demonstrated" voice recognition was clearly very clever, but an advert is a long way from reality.

The first problem is lack of accurate voice recognition. I remember a Microsoft conference where the speaker was dictating a letter while trying to type "Dear Mom," he managed to achieve "Dear aunt, let's set so double the killer delete select all.". Well that was a few years before Siri on Apple and S-Control and S-Voice on Android, but the results are still a pathetic waste of time.

A quick test of "Call Lee" returned "no contacts of that name found". I have two contacts with Lee and one Leigh, Siri failed to spot any of these on multiple tests in a quiet environment.

The slightly harder "Call Sielde" return the delightful message "placing call to Thomas Merryfield". Tried about 14 times and only managed to get Thomas Merryfield or no contacts found.

There is no way that Sielde sounds anything like Thomas Merryfield or that

Anything more complex really resulted in some delightful responses. "Event get up early Monday 11th 6am" was probably the closest with the following:

- Creating event I love am Monday 11th June 22:59 - 23:59
- Creating event Get up early 6am 11th February 2013

Now it takes the voice recognition 22 seconds to achieve a failure, where as correct manual entry takes 18 seconds.

The only task that worked with any consistency with a glorious 6 successful out of 8 was "play music Lacuna Coil" it felt like I had discovered the Konami code all on my own, finally a voice command that works...

Now For the Science Bit

Human voice recognition is dramatically superior, in two respects

1. Understanding the actual words
2. Understanding the contextual use of these words

When listening to someone talk we use actions, gestures, recent context and various sentence and lip pattern recognition to understand what someone is saying. When you cannot see someone's lips it becomes much harder to understand them, not knowing the context further reduces our ability to understand words. However, with context we can achieve amazing things with context and lip reading so that even distorted sound can be distinguished, even incorrectly pronounced sentences can be understood perfectly.

Take for example the situation where I have 300 people on my phone called John. If I was talking with a friend about John's new baby and wondering how he was getting on, if I asked a human to "Call John" they would through context know that it was the John with the new baby I wanted to call, Siri (if it got passed the first hurdle and understood you meant John) would say, you have 300 contacts with the name John.

But this is one example of context, the second human benefit is adjusting the method based on feedback. For example if a phone operator is struggling to understand a caller they will ask them to spell out details "Lima, Oscar, Mike..." This adjustment breaks the negative feedback loop that exists with computer voice recognition of constant failure due to inadequate voice and context recognition.

Now achieving full human level context would be hard for a phone, it would need to identify all interactions, constantly record perhaps the last 30 minutes of conversation and analysis them appropriately, monitor your browsing and application interaction and location to attempt to determine who it should be contacting. However, there does appear to be some basic steps that it could take to make things better.

Lets Call Lee 

A few steps to force context and feedback could have reduced my frustration hugely, and would almost always result in a successful voice recognition call

1. Cut down the contact list to people who have previously called or been called, perform a very fuzzy match (probably resolve 99% of my issues)
2. Cut the contact list to just those with phone numbers with a stricter voice search.
3. Alter context based on feedback:
- Sorry I did not recognise the name please try again
- Sorry I cannot recognise the name please spell it for me...
- Sorry I still cannot understand I will read a list of recently called contacts please say stop when I read out the correct contact. 

Breaking negative feedback loops is perhaps the best feature voice recognition can do, the next is learning, this is harder and potentially requires significant data storage, but by analysising correct responses the feedback loop can be reduces so that call Lee always returns the correct response and does not get to stage 3 part 3 before you can initial a phone call.

Still even in Star Trek they didn't think the computer voice control could cope with navigation tasks, my phone certainly could not cope with "navigate to Bicester" (pronounced correctly) despite it being one of the nearest villages to Oxford, that leaves no chance of it correctly navigating through uncharted galaxies ;).