Speech-To-Text: An Architectural Overview of Neural Networks for Converting Audio to Text

Introduction

In this article, I will delve into the technical aspects of Speech-To-Text (STT) systems, which are becoming increasingly popular and beneficial for businesses. I will provide an overview of the neural network architecture used to convert audio signals into text, highlighting the modifications made to traditional transformer architectures to handle audio data.

‍

Audio Processing

Before feeding audio data into a neural network, it must be preprocessed. For audio, this typically involves converting the audio file into either a waveform or a mel spectrogram. While a waveform represents the amplitude of the audio signal over time, a mel spectrogram provides a more detailed representation of the audio's frequency content and its evolution over time. Mel spectrograms are generally preferred for STT tasks due to their ability to capture essential audio characteristics.

Mel spectrograms are often represented as tensors with dimensions [number of audio frequencies, number of time steps]. For instance, a tensor of [128, 200] indicates that the spectrogram has 128 frequencies for each of the 200 time steps in the audio.

Since audio files can vary in length, the number of time steps can differ significantly. Padding with zeros, a common approach in NLP tasks, is not always suitable for audio data as it can introduce distortions. Instead, techniques like linear interpolation with bilinear or bicubic modes are employed to stretch or shrink the time steps of the spectrogram to a consistent length without resorting to simple padding.

‍

Encoder Architecture

‍

The encoder in an STT system differs from the encoder used in traditional NLP tasks due to the additional dimension introduced by time steps in the audio data. Convolutional neural networks (CNNs) are commonly used in this stage to extract relevant features from the mel spectrograms.

CNNs operate by sliding a filter of a specified size across the spectrogram. The filter weights, which are initially random, are adjusted during the training process. The filter's output is a compressed (filtered) matrix that encapsulates the essential characteristics of the spectrogram.

Multiple CNN layers can be stacked, and pooling layers may be interspersed to reduce dimensionality. The modified encoder typically ends with classic transformer encoder layers, including sinusoidal positional encoding, MLP, and multi-head attention. For more details on these components, refer to my previous article on transformer architectures.

‍‍

Text Processing

In some STT systems, the text output is further processed using phonemes, which are the basic units of sound in a language. The original text is converted into phonemes, tokenized, and fed into the decoder. The decoder's processing and the overall principles of transformer architectures have been described in my previous articles and will not be repeated here.

‍

Conclusion

This article aimed to provide a technical understanding of STT systems, highlighting the modifications made to transformer architectures to handle audio data. For insights into the business benefits of STT technology, please refer to my article "Expanding Horizons: How Voice Call Analysis is Changing the CRM and ERP Landscape."

I hope this article was informative and engaging. Stay tuned for future discussions!