Dhwani is an unlabeled audio speech recognition (ASR) corpus obtained from YouTube videos and News on AIR news bulletins. The dataset contains raw audio recordings across 40 Indian languages, making it a valuable resource for speech recognition and language modeling tasks involving Indian languages.
The folder structure of the YouTube audio dataset is as follows:
YT
├── bengali
│ ├── XXXXXXXXXXX.wav
│ ├── XXXXXXXXXXX.wav
│ ├── XXXXXXXXXXX.wav
│ └── ...
├── gujarati
├── ...
NOA
├── Audio
│ ├── assamese
│ ├── audio
│ ├── newsonair.nic.in
│ ├── NSD-Assamese-Assamese-0705-0715-201810107486.mp3
│ ├── NSD-Assamese-Assamese-0705-0715-20181011161537.mp3
├── gujarati
├── ...
If you use this YouTube audio dataset or any other resources from AI4Bharat, please cite the following article:
@dataset{
}
The Dhwani dataset, models, and code are released under the MIT License.
The contributors acknowledge the following organizations and entities for their support and contributions:
Ask anything...