Breaking Down Speech Recognition Powered by Intelligent Machines

Follow Us
Share on Twitter
Share on Facebook
Breaking Down Speech Recognition Powered by Intelligent Machines

Speech Recognition—sometimes referred to as automatic speech recognition (ASR), speech to text (STT), or computer speech recognition—is the task of converting spoken language into text. ASR processes raw audio signals and transcribes them.

ASR falls under the family of “conversational AI” applications. Conversational AI is the use of natural language to communicate with machines. It typically consists of three subsystems:

  1. Automatic speech recognition: transcribing the audio
  2. Natural language processing: extract meaning from the transcribed text. NLP is concerned with computers processing and analyzing natural language, i.e., any language that has developed naturally, rather than artificially (such as with computer coding languages).
  3. Text-to-speech (TTS) or speech synthesis: converts text to human-like speech.

Each of the three subsystems integrates multiple neural networks to create a seamless experience for the end-user. Neural networks are a programming paradigm that enables a computer to learn from observational data using a unique architecture of small functional units (the neurons) which are wired together in a manner mimicking a rudimentary model of the human brain.

The most prominent voice-driven applications that have embedded into our day-to-day lives are voice assistants such as Apple's Siri, Amazon's Alexa, Google Assistant, as well as Microsoft's Cortana. These question-answering systems have been improving at a tremendous pace in the last few years due to advances in deep learning and big data. As an example of the forward momentum, Google just announced Meena, a “2.6 billion parameter end-to-end trained neural conversational model” designed to handle wide-ranging conversations.

Our previous posts explored NLP techniques such as sentiment analysis and named entity recognition (NER). 

Here, we explore the first subsystem of a conversational workflow: automatic speech recognition. We explain how it works, explore some use-cases, and see how you can apply it in your business.

How does speech recognition work?

At its heart, speech recognition is a 3-step process:

  1. Feature extraction: we first sample the continuous raw audio signal into a discrete one. The discrete signal needs to be converted into a form that is digestible by a machine learning algorithm, such as a spectrogram. The spectrogram input to the algorithm can be thought of as a descriptive numerical vector at each timestep. It is obtained by applying a mathematical operation called a short-time Fourier transform on the discrete audio signal.
  2. Acoustic modeling: this takes the spectrogram and predicts the probability of all words in a vocabulary for each time step.
  3. Language modeling: this adds contextual information about the language. This is used to correct mistakes in the acoustic model. It tries to determine what was spoken by combining both what the acoustic model thinks it heard with what is a likely next word (based on its knowledge of the language)

The acoustic model and language model are types of neural networks. 

ASR models are evaluated using the word error rate (WER) percentage. This is the percentage of the total words the model inserted, deleted, and substituted divided by the length of the words in the actual transcription.

Where does speech recognition struggle?

There are numerous challenges faced by ASR. Here are just a few:

  • Background noise
  • Different accents and dialects
  • Low-quality microphone/recording equipment
  • Listening to what people say is more than just hearing the words they speak; we engage all our senses during a conversation, reading a person’s facial expressions, body language, and inflections in their voice. This information is lost when processed as a raw audio signal.
  • Multiple speakers
  • Abbreviations and continuously evolving language.
  • Overlapping speech
  • Homonyms: words like 'there/their,' 'air/hair,' 'right/write' are pronounced similarly but have different meanings.

The most precise way forward, as is often the case in machine learning, is to generate more relevant labeled training data. By relevant, we mean training data that is more representative of the population of situations that the model will encounter once deployed.

The more high quality and relevant labeled data you train your model off, the better it becomes at handling noise, accents, and other variations in speech.

Speech recognition use cases

In line with its versatility, ASR has a wide range of use cases. With the advancements in conversational AI, it would not be surprising for speech to become the dominant user interface in the coming decade.

Some notable use cases include:

  1. Voice-driven intelligent virtual assistants (IVA): speaking is a more natural interface to interact with an intelligent machine compared to typed text or pushing buttons. Smart virtual assistants have seen a tremendous adoption rate due to the ease of interaction. With the global market forecast to achieve a 19.8% annual compound growth rate through 2026.
  2. Home automation: with the move to smart homes and IoT devices, there is going to be rapid growth in voice-activated devices. In fact, 24% of US adults already own at least one smart speaker.
  3. Smart devices: from our phones, computers, and watches to our TV, refrigerators, and In-car systems, this decade, will be transformed by conversational AI applications. At the minute, 54% of the U.S. population report having used voice-command technology at some point, and that figure is sure to rise.
  4. Generating transcripts of discussions and meetings in conferences, meetings, and classroom lectures. Such tools increase inclusivity for people with hearing and seeing difficulties.
  5. Automated call centers to reduce the staffing costs of human customer representatives in multiple industries, from banking to healthcare
  6. Automatic captioning of videos using speech recognition: with some 500 hour’s worth of video content uploaded to YouTube every second, manual video captioning is not a feasible option. YouTube uses ASR to generate subtitles for its videos.
  7. Seamless translation of languages
  8. Hands-free computing

How can I use speech recognition?

If you think that your business or project could benefit from ASR, it’s pretty easy to start. Kaldi and Nvidia's NeMo (standing for neural modules) are popular open-source toolkits for speech recognition.

But before you begin using one of these frameworks to build a model, you will need to produce a relevant labeled dataset to train the model.

With, you can provide us with your raw audio files  and we’ll transcribe the audio for you, returning a high-quality training dataset that you can take to train and tailor your ASR model off.

If you’re interested in learning more, or have a specialized use case, reach out to us. You can also stay tuned to our blog, where we’re continuing to run a series of posts covering different aspects of NLP.

Ralf Banisch
Ralf Banisch
Data Scientist
Featured Posts
What Exactly Is AI?
Automating Testing, Inspection, and Certification with Artificial Intelligence
Super.AI at Intelligent Automation Week Winter 2021
Confidential Information is Risky—So Automatically Redact It
Super.AI at Slush 2021
6 Ways to Use Automatic Image Processing to Streamline Your Business
Detect Vehicle Damage Automatically with AI-Powered Image Processing
Modernizing Optical Character Recognition (OCR) with Artificial Intelligence
What Is Intent Recognition and How Can I Use It?
Deconstructing the Super.AI UDP Platform: The AI Compiler
Built with Super.AI: Cashierless Checkout
Automating Product Recommendations with AI
AI-Powered Nameplate Data Extraction for Testing, Inspection, and Certification (TIC) Services
Introducing Super.AI Image Redact
Approaching Proof of Concept like Sun Tzu, A Military Strategist and Philosopher
Real-world Applications of Sentiment Analysis
Real-world Applications of Optical Character Recognition
How Artificial Intelligence Simplifies Problems
Ground Truth Data Guarantees Output Quality
Deconstructing the Super.AI UDP Platform: Data Lifecycle
Deconstructing the Super.AI UDP Platform: Quality Assurance
Deconstructing the Super.AI UDP Platform: Our Crowd vs. Your Own Labelers
Deconstructing the Super.AI UDP Platform: Data Programming
The Big Cost of Corrosion
Event Recap: Super.AI at AI in the City
AI in Tech: Automation Through Machine Learning
Button Text