Summary of Azure Neural TTS voices upgraded to 48kHz with HiFiNet2 vocoder

  • techcommunity.microsoft.com
  • Article
  • Summarized Content

    Azure Neural TTS: Hi-Fi Text-to-Speech with 48kHz Voices

    Azure Neural Text-to-Speech (Neural TTS) has significantly enhanced its capabilities by introducing a new hi-fidelity vocoder, HiFiNet2, which empowers over 400 voices to produce 48kHz audio. This advancement leads to an exceptional synthetic voice sound experience, providing richer detail and cleaner, more enjoyable audio, especially valuable for scenarios like video dubbing, gaming, and singing.

    Understanding HiFiNet2: The Key to Hi-Fi Text-to-Speech

    HiFiNet2 is a neural network based vocoder that plays a crucial role in generating synthesized speech audio. It takes acoustic features, or an intermediate representation of the audio, and transforms them into an audible waveform. HiFiNet2 is a major upgrade from its predecessor, HiFiNet1, which was limited to 24kHz.

    Key Improvements Brought by HiFiNet2

    • Enhanced Audio Fidelity: HiFiNet2 delivers exceptional audio fidelity, producing 48kHz voices that are significantly more detailed and natural than 24kHz voices. This improvement is especially noticeable in scenarios demanding high-quality audio, such as video dubbing and gaming.
    • Separated Bandwidth Design: HiFiNet2 utilizes a separated bandwidth design, which allows it to efficiently process different frequency ranges with specialized model structures. This results in improved voice quality and reduced inference cost.
    • Universal Voice Capability: The new vocoder is universal, meaning it can be applied to any speaker without the need for additional training. This makes it easier and faster to create custom voices, particularly with limited data or less professional recordings.
    • Faster Inference Speed: HiFiNet2 offers faster inference speed compared to HiFiNet1, especially for 24kHz voices. While 48kHz voices require a slightly higher inference cost, the latency remains comparable to the current 24kHz voices.
    • Unified Framework for Diverse Applications: HiFiNet2 acts as a unified framework for various applications, including speech generation, singing, and bandwidth extension. This flexibility allows for more efficient resource management and faster development cycles.

    Benefits of 48kHz Audio in Text-to-Speech

    The higher sampling rate of 48kHz provides several advantages in text-to-speech:

    • Richer Sound: 48kHz audio captures more subtle details and nuances of the sound, resulting in a richer, more realistic listening experience.
    • Enhanced Clarity: With higher fidelity, 48kHz voices are less prone to distortion and artifacts, making them sound cleaner and more enjoyable.
    • Improved Timbre: 48kHz voices more accurately replicate the original speaker's timbre, making them sound more authentic and natural.

    Custom Neural Voice: Creating Unique Voices with HiFiNet2

    The Custom Neural Voice feature allows users to create unique synthetic voices that are perfectly aligned with their brand or application. With HiFiNet2, this capability is further enhanced, enabling the creation of high-fidelity voices with a sampling rate of 48kHz. This empowers organizations to generate voices that are truly unique and capture the essence of their brand.

    Bandwidth Extension: Boosting Audio Quality with HiFiNet2

    Bandwidth extension technology allows for the quick and efficient upgrade of existing voices to higher fidelity. This is especially beneficial for customers who have built their voices with lower-fidelity data and want to improve their quality. HiFiNet2 enables this process with its specialized model structure for bandwidth extension, making it possible to achieve high-fidelity voices with minimal effort.

    • Faster Time-to-Market: Bandwidth extension technology significantly reduces the time required to upgrade existing voices, allowing for faster deployment of high-fidelity text-to-speech solutions.
    • Improved Audio Quality: Bandwidth extension can enhance the quality of existing audio recordings, boosting lower-fidelity audio to higher-fidelity, enriching the listening experience.

    Singing Voices: Expanding Text-to-Speech Capabilities with HiFiNet2

    Singing voices pose a significant challenge for traditional speech vocoders due to the need for stable pitch, accurate reconstruction of long notes, and exceptionally high fidelity. HiFiNet2 overcomes these challenges with its unified framework, enabling the generation of both high-quality speech and singing voices with a single model.

    • High-Quality Singing Voice: HiFiNet2 delivers a highly realistic and engaging singing voice, accurately replicating the nuances and complexities of human vocal performance.
    • Efficient Inference Speed: Despite the complexity of singing voice generation, HiFiNet2 maintains fast inference speed, allowing for seamless integration into various applications.

    Conclusion: Empowering Exceptional Text-to-Speech Experiences with HiFiNet2

    HiFiNet2 marks a significant leap forward in text-to-speech technology, providing exceptional hi-fidelity audio, increased efficiency, and versatile application capabilities. This advancement opens up new possibilities for users seeking to create natural and engaging voice experiences, whether for video dubbing, gaming, singing, or simply delivering high-quality, natural-sounding audio.

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.