Summary of Dhwani Dataset – AI4BHĀRAT

  • ai4bharat.iitm.ac.in
  • Article
  • Summarized Content

    Overview of the YouTube Audio Dataset (Dhwani)

    Dhwani is an unlabeled audio speech recognition (ASR) corpus obtained from YouTube videos and News on AIR news bulletins. The dataset contains raw audio recordings across 40 Indian languages, making it a valuable resource for speech recognition and language modeling tasks involving Indian languages.

    Dataset Details

    • The dataset comprises audio files from two sources:
      • YouTube videos
      • News on AIR (NewsOnAir.gov.in) news bulletins
    • The audio files are organized into separate folders for each language.
    • For YouTube audio, the filenames are the YouTube video IDs.
    • For News on AIR audio, the filenames are a concatenation of the region name and bulletin timing.

    Folder Structure

    The folder structure of the YouTube audio dataset is as follows:

    For YouTube

    YT
    ├── bengali
    │   ├── XXXXXXXXXXX.wav
    │   ├── XXXXXXXXXXX.wav
    │   ├── XXXXXXXXXXX.wav
    │   └── ...
    ├── gujarati
    ├── ...
    

    For News on AIR

    NOA
    ├── Audio
    │   ├── assamese
    │       ├── audio
    │          ├── newsonair.nic.in
    │           ├── NSD-Assamese-Assamese-0705-0715-201810107486.mp3
    │           ├── NSD-Assamese-Assamese-0705-0715-20181011161537.mp3
    ├── gujarati
    ├── ...
    

    Downloads

    Citing the Dataset

    If you use this YouTube audio dataset or any other resources from AI4Bharat, please cite the following article:

    @dataset{
    
    }
    

    License

    The Dhwani dataset, models, and code are released under the MIT License.

    Contributors

    • Tahir Javed, (IITM, AI4Bharat)
    • Sumanth Doddapaneni, (AI4Bharat, RBCDSAI)
    • Abhigyan Raman, (AI4Bharat)
    • Kaushal Bhogale, (AI4Bharat)
    • Gowtham Ramesh, (AI4Bharat, RBCDSAI)
    • Anoop Kunchukuttan, (Microsoft, AI4Bharat)
    • Pratyush Kumar, (Microsoft, AI4Bharat)
    • Mitesh Khapra, (IITM, AI4Bharat, RBCDSAI)

    Acknowledgments

    The contributors acknowledge the following organizations and entities for their support and contributions:

    • EkStep Foundation for their grant to set up the Centre for AI4Bharat at IIT Madras
    • The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under the Bhashini project
    • Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training models
    • Microsoft for its grant to create datasets, tools, and resources for Indian languages

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.