Azure Neural Text-to-Speech (Neural TTS) has significantly enhanced its capabilities by introducing a new hi-fidelity vocoder, HiFiNet2, which empowers over 400 voices to produce 48kHz audio. This advancement leads to an exceptional synthetic voice sound experience, providing richer detail and cleaner, more enjoyable audio, especially valuable for scenarios like video dubbing, gaming, and singing.
HiFiNet2 is a neural network based vocoder that plays a crucial role in generating synthesized speech audio. It takes acoustic features, or an intermediate representation of the audio, and transforms them into an audible waveform. HiFiNet2 is a major upgrade from its predecessor, HiFiNet1, which was limited to 24kHz.
The higher sampling rate of 48kHz provides several advantages in text-to-speech:
The Custom Neural Voice feature allows users to create unique synthetic voices that are perfectly aligned with their brand or application. With HiFiNet2, this capability is further enhanced, enabling the creation of high-fidelity voices with a sampling rate of 48kHz. This empowers organizations to generate voices that are truly unique and capture the essence of their brand.
Bandwidth extension technology allows for the quick and efficient upgrade of existing voices to higher fidelity. This is especially beneficial for customers who have built their voices with lower-fidelity data and want to improve their quality. HiFiNet2 enables this process with its specialized model structure for bandwidth extension, making it possible to achieve high-fidelity voices with minimal effort.
Singing voices pose a significant challenge for traditional speech vocoders due to the need for stable pitch, accurate reconstruction of long notes, and exceptionally high fidelity. HiFiNet2 overcomes these challenges with its unified framework, enabling the generation of both high-quality speech and singing voices with a single model.
HiFiNet2 marks a significant leap forward in text-to-speech technology, providing exceptional hi-fidelity audio, increased efficiency, and versatile application capabilities. This advancement opens up new possibilities for users seeking to create natural and engaging voice experiences, whether for video dubbing, gaming, singing, or simply delivering high-quality, natural-sounding audio.
Ask anything...