Summary of Audio Deep Learning Made Simple (Part 3): Data Preparation and Augmentation

  • towardsdatascience.com
  • Article
  • Summarized Content

    Introduction to Audio Deep Learning

    This is the third article in a series on audio deep learning, where the author aims to explain not just how techniques work but also why they work that way. The series covers:

    • State-of-the-art audio deep learning techniques and the importance of spectrograms
    • Generating Mel Spectrograms for audio data preprocessing in Python
    • Enhancing Mel Spectrograms and data augmentation techniques (this article)
    • Sound classification with end-to-end examples and architectures
    • Automatic speech recognition using CTC loss and decoding for sequence alignment
    • Beam search algorithm used in speech-to-text and NLP applications

    Optimizing Mel Spectrograms with Hyperparameter Tuning

    To get the best performance from deep learning models on audio data, we need to optimize the Mel Spectrograms for the specific problem. This involves understanding how Spectrograms are constructed using techniques like the Fast Fourier Transform (FFT) and Short-time Fourier Transform (STFT).

    The key hyperparameters for tuning Mel Spectrograms include:

    • Frequency Bands: fmin (minimum frequency), fmax (maximum frequency), n_mels (number of Mel frequency bins)
    • Time Sections: n_fft (window length for each time section), hop_length (number of samples to slide the window)

    MFCC for Human Speech

    For audio deep learning problems involving human speech, such as Automatic Speech Recognition (ASR), Mel Frequency Cepstral Coefficients (MFCC) can sometimes perform better than Mel Spectrograms. MFCC applies further processing to Mel Spectrograms to extract a compressed representation of the most common frequencies in human speech.

    Data Augmentation for Audio

    Data augmentation is a technique to increase the diversity of the dataset by modifying existing data samples in small ways. For audio data, augmentation can be performed on the raw audio or the generated spectrogram.

    • Spectrogram Augmentation: SpecAugment blocks out sections of the spectrogram using frequency masks (horizontal bars) and time masks (vertical bars).
    • Raw Audio Augmentation:
      • Time Shift: Shift audio to the left or right by a random amount
      • Pitch Shift: Randomly modify the frequency of parts of the sound
      • Time Stretch: Randomly slow down or speed up the sound
      • Add Noise: Add random values to the sound

    Conclusion

    This article covered essential techniques for preparing audio data for deep learning models, including optimizing Mel Spectrograms through hyperparameter tuning and data augmentation. With these techniques, the author aims to provide a foundation for understanding and implementing various audio deep learning applications, which will be explored in subsequent articles.

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.