You can just say a few words into the air, and a nearby digital device can understand you and grant your wishes.
At Capturi we work with machine learning and speech recognition in order to design digital tools that make meetings more effective. One of our top specialists in the house is Morten, who is an expert in speech recognition and has worked with it since 2004. In this interview, we have dived into his advanced brain to understand the magic behind voice technology and to get an idea of the possibilities it gives to the business world.
The blog post is part 1 of 3 in a blog series.
What is speech recognition?
Most of us know speech recognition from Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana or Google Assistant. Every week the iPhone assistant Siri handles over 2 billion voice commands that range from “call mom” to “where’s the closest pizzeria?” But how does this “magic” work? Morten explains here:
”It’s the process of transforming speech to text. There is a computer or a device that listens to what you say. And inside that device, models and a speech recognizer try to match phoneme sequences to your speech patterns. The device puts the sequences together as words and then puts the words together as sentences.”
Okay, so the speech recognizer makes matches? Then it must know something about how the language is structured to be able to make qualified guesses? Morten elaborates and explains that he works with three information sources (three data bases) that informs the speech recognizer about the structure of speech:
- THE ACOUSTIC MODEL: The model describes what the individual letters sonically consist of – e.g. what does a S consist of and what does an A consist of? And how likely are these sounds in the language?
- THE PRONOUNCIATION DICTIONARY: The speech recognizer is told what ”soundbites” a word consists of. For example, the word “sky” will be translated to four phonemes: [skaɪ].
- THE LANGUAGE MODEL: The last model is a list of the most common words and their probability. Therefore, this model is based on statistics that tells you: ”What is the probability that these words will be followed by each other?”
Every day Morten is working with enormous amounts of data that he “feeds to the machine”, and thereby he trains the different models – the more training material, the more precise the speech recognizer will be.
Who is already taking advantage of the technology today?
Besides the personal assistant on your smartphone, there are plenty of industries that already have seen a good use of speech recognition. When lawyers, doctors and caseworkers dictate cases and journals, then we are dealing with a classic use of speech recognition. Morten explains:
“You sit in a quiet room with a headset or a microphone close to your mouth, and then you articulate very clearly to a machine and try to make it understand you as good as possible. It is the easiest way to make speech recognition – when the user twists his tongue to make the speech recognizer understand what is said. You talk very clearly, remember all the word endings and not too fast – and you make sure your surroundings are quiet.”
Speech recognition can also be used for reading and language training. For example, if you want to learn Spanish, you can use the app Duolingo which shows you a picture, plays some speech, and then you must repeat or choose what is on the picture. The app then helps you by saying “Good job” or it gives you suggestions for adjustments you can make.
Another interesting area is voice control. Today you can control a great number of things with a simple voice command – you can change the TV channel, call from your car, turn off the lights in your house etc. Morten points out that voice control is especially an advantage in “the industries where you need to use your hands for something else. E.g. surgeons who operates and want something else to show on their display”.
The technology is used in a lot of other contexts and international experts estimate that speech recognition is a fast-growing technology and on its way to become a productive tool in a broad business perspective. At Capturi, we have already implemented speech recognition in our newest meeting app.
But can we control everything by voice then?
The answer is no. Even though the technology is mature enough to make a lot of business processes more effective, there are still challenges – and a long way before machines can match human competencies and cognitive abilities. However, the latest progress is that Google's speech recognition technology now has a 4.9% word error rate – this means that Google only transcribes every 20th word incorrectly, which is a big improvement from 2013, where almost every 4th word was incorrect. In other words, the technological progress is speeding up and Morten also sees a big potential in the development:
“We hope to improve the system. Make it better, get more accurate recognitions and pull out more relevant words. There isn’t a lot of people, who does exactly what we do. It is relatively new. Therefore, we want to make it work – and make it work better than anybody else.”
The ambitions are high and the possibilities are great, because with speech recognition you don’t need to build a device from scratch. Morten’s team takes advantage of a bunch of open source tools and they work with a “patchwork” of other people’s work:
“It is a giant patchwork and there are thousands of man hours that have already gone into this. People have put all kinds of parts in this. We then try to improve aspects of it, and some things we code from the beginning.”
Join us next time, where we continue the second part of our blog series and dive into pros and cons of speech recognition.
Smarter meetings? Yes please!
If you liked this article from Capturi, you’re probably on a mission to having more effective meetings – just like us! You might want to try our platform. It’s a simple tool that manages all your meetings and tasks in ONE place. Collaborate, assign tasks, add due dates and keep track of your team’s progress.
How voice technology is transforming computing: http://www.economist.com/news/leaders/21713836-casting-magic-spell-it-lets-people-control-world-through-words-alone-how-voice
What’s now and next in analytics, AI, and automation: http://www.mckinsey.com/global-themes/digital-disruption/whats-now-and-next-in-analytics-ai-and-automation?cid=soc-web