This is the third article in a series on audio deep learning, where the author aims to explain not just how techniques work but also why they work that way. The series covers:
To get the best performance from deep learning models on audio data, we need to optimize the Mel Spectrograms for the specific problem. This involves understanding how Spectrograms are constructed using techniques like the Fast Fourier Transform (FFT) and Short-time Fourier Transform (STFT).
The key hyperparameters for tuning Mel Spectrograms include:
For audio deep learning problems involving human speech, such as Automatic Speech Recognition (ASR), Mel Frequency Cepstral Coefficients (MFCC) can sometimes perform better than Mel Spectrograms. MFCC applies further processing to Mel Spectrograms to extract a compressed representation of the most common frequencies in human speech.
Data augmentation is a technique to increase the diversity of the dataset by modifying existing data samples in small ways. For audio data, augmentation can be performed on the raw audio or the generated spectrogram.
This article covered essential techniques for preparing audio data for deep learning models, including optimizing Mel Spectrograms through hyperparameter tuning and data augmentation. With these techniques, the author aims to provide a foundation for understanding and implementing various audio deep learning applications, which will be explored in subsequent articles.
Ask anything...