Acoustic Source Separation Improves Speech Recognition

Overview

In noisy environments, acoustic source separation can improve speech enhancement and recognition accuracy for mobile phones, wearables, and other smart devices.

Voice or speech interfaces on mobile phones, tablets, computers, wearables, and other smart devices are increasingly common and important, because they allow interactions without keyboards or touchscreens. To provide accurate speech processing, systems must reliably recognize speech even under complex noise conditions.

Millions of people already rely on automatic speech recognition to transcribe spoken words into text for documentation and text generation. However, the quality of automatic speech recognition depends on favorable conditions, such as speakers whose voices closely match the training data and relatively quiet environments. Even then, human editors are often required to fix transcription errors, punctuation, and grammar, and other types of translation errors may occur. Continued improvements in speech technology are necessary to raise recognition accuracy across mobile and embedded applications and in noisy environments such as automotive cabins.

Noise Suppression

Speech enhancement relies on acoustic source separation and noise suppression techniques. This article focuses on acoustic source separation; noise suppression is summarized briefly.

Noise suppression helps remove various types of background noise that interfere with speech recognition. Noise characteristics can be described in both time and frequency domains. Time-domain noise includes continuous, intermittent, and impulsive noise. Frequency-domain noise includes wideband and narrowband noise. Continuous noise, such as office or traffic sounds, machine operation noise, and hiss, changes slowly. Intermittent noise is repetitive, such as horns or bells. Impulsive noise is abrupt, like clicks or knocks. Wideband noise, such as hiss, spans many frequencies, while narrowband noise occurs within a specific frequency range and includes sine tones, hums, and mechanical noise.

Engineers have tried various filtering techniques, each effective against certain noise types. Because noise characteristics can change over time, adaptive algorithms are often needed to track those changes. Examples of noise removal techniques include spectral compensation, impulse filtering, adaptive wideband filtering, adaptive inverse filtering, and stereo filtering.

What Is Acoustic Source Separation

Acoustic source separation is an alternative approach to improve speech recognition. Rather than only masking and filtering noise, it focuses on identifying specific features of human speech to distinguish and pass valid speech while rejecting background noise. This technique can substantially increase speech clarity and recognition accuracy in noisy environments. To reliably identify speech components, source separation systems combine acoustic and speech models. Two modeling approaches used in embedded designs are described here: deep neural networks and cochlear simulation, which models the human auditory system from the inner ear to the brain.

Deep Neural Network Approach

The deep neural network approach requires a large database containing hundreds of hours of speech and noise for training. Initially the network has no concept of speech; through extensive training it learns to identify different speech patterns. The quality of source separation includes the ability to determine sound origins. Using two or more microphones to capture audio data improves performance, and networks can even be trained to identify who is speaking and when.

Information from the database is used to create compact, fast algorithms that are then deployed to target digital signal processors (DSPs) to monitor and classify speech. The set of adaptive algorithms developed from the database is referred to as a neural network.

Neural networks decompose input audio and analyze segments to identify different speech patterns. They evaluate features such as frequency, harmonics, attack, and decay characteristics to distinguish speech from environmental noise. Networks trade off performance against audio sampling rate: lower sampling rates require less processing but are less precise, while higher rates are more accurate but computationally more complex.

Various filtering algorithms are applied to identify desired waveforms while removing unwanted audio components. Using multiple filters is more effective at suppressing noise while recovering lost audio components. During post-processing, algorithm parameters are tuned to optimize audio for human listening or for speech recognition systems, which is important because humans and recognition systems use different parsing strategies.

Cochlear Simulation

This source separation method runs computational auditory scene analysis (CASA) algorithms on a DSP platform to model how the human auditory system extracts speech from noisy environments. The approach encodes audio information for grouping and analysis. There are dozens of grouping cues related to time and frequency, including pitch, spatial location, and onset/offset times.

Pitch is an important grouping cue because harmonic patterns help identify unique sound sources. With two or more microphones, systems can use spatial information to estimate direction and distance for each microphone. CASA modeling enables the so-called cocktail party effect, allowing the system to focus on a particular sound source, such as a specific speaker, while suppressing background noise. Onset/offset grouping refers to the times when a sound component begins and ends; combined with frequency data, this helps determine whether components belong to the same source.

Sounds with similar attributes form a common audio stream, while sounds with different attributes form separate streams. The system uses these streams to identify persistent or repeating sound sources. After sufficient grouping, the separation process matches identified sources and responds to the actual speaker. Reverse transformation reconstructs the data into audio streams for playback.

Figure 2: Various acoustic source separation methods enable the cocktail party effect, isolating a target source while suppressing background noise.

Considerations

Acoustic source separation is useful beyond high-quality speech recognition. For example, in emergency situations, noisy and chaotic environments make fast and accurate voice communications critical. Clear speech recognition can help first responders locate people in distress. Compared with noise suppression alone, source separation provides a more effective mechanism to improve speech communication in uncontrolled environments.

Dedicated DSP voice processors can optimize performance while keeping power consumption low, which is important for always-on voice applications or modes that require manual activation. Always-on voice functionality consumes energy because the system processor remains active. To save battery life, always-on voice applications can use dedicated voice processors that support sleep modes, maintain essential functionality, and provide low-power listening and full-wake modes.

Voice features are no longer limited to handheld devices and smartphones; wearables benefit from voice interfaces that remove the need for keyboards or touchscreens. As voice capabilities mature, the interaction distance between users and devices increases. For example, some smart TVs already support voice commands from across a living room; these deployments raise considerations for user privacy and security and require robust implementation. Voice features are likely to appear in more traditional electronics as they become more mature.