Pushing the boundaries of audio creation

[ad_1]

Applied sciences

Printed
Authors

Zalán Borsos, Matt Sharifi and Marco Tagliasacchi

An illustration showing speech patterns, iterative progress in dialogue generation, and a relaxed conversation between two voices.

Our groundbreaking speech technology applied sciences assist individuals around the globe work together with extra pure, conversational and intuitive digital assistants and AI instruments.

Language is central to human connection. It helps individuals around the globe change info and concepts, categorical feelings and create mutual understanding. As our expertise to create pure, dynamic voices continues to enhance, we’re unlocking richer, extra partaking digital experiences.

Lately, now we have pushed the boundaries of audio technology, growing fashions that may produce high-quality, pure speech from a variety of inputs reminiscent of textual content, pacing, and particular voices. This expertise permits single-speaker audio in lots of Google merchandise and experiments – together with Gemini Reside, Challenge Astra, Journey Voices and YouTube Auto-Sync – and helps individuals around the globe with extra pure, conversational and intuitive digital assistants and AI. instruments to work together.

Working with companions at Google, we not too long ago helped develop two new options that may generate lengthy, multi-speaker dialogues to make complicated content material extra accessible:

  • NotebookLM Audio Overviews transforms uploaded paperwork into partaking and full of life dialogues. With a click on, two AI hosts summarize consumer materials, make connections between matters, and banter forwards and backwards.
  • Illuminate creates formal, AI-generated discussions of analysis papers to make information extra accessible and digestible.

Right here we offer an outline of our newest analysis on speech technology, which underlies all of those merchandise and experimental instruments.

Pioneering audio technology methods

Now we have been investing in audio technology analysis for years, exploring new methods to create extra pure dialogue in our merchandise and experimental instruments. In our earlier analysis on SoundStorm, we first demonstrated the flexibility to generate 30-second segments of pure dialogue between a number of audio system.

Xem thêm  The 12 months of AI: How ChatGPT, Gemini, Apple Intelligence and Extra Modified Every part in 2024

This prolonged our earlier work SoundStream and AudioLM, which allowed us to use many text-based language modeling methods to the issue of audio technology.

SoundStream is a neural audio codec that effectively compresses and decompresses an audio enter with out compromising its high quality. As a part of the coaching course of, SoundStream learns learn how to map audio to a sequence of acoustic tokens. These tokens seize all the knowledge wanted to reconstruct the audio materials with excessive constancy, together with properties reminiscent of prosody and timbre.

AudioLM treats audio technology as a language modeling activity to generate the acoustic tokens of codecs reminiscent of SoundStream. Subsequently, the AudioLM framework makes no assumptions concerning the kind or composition of the audio produced and may flexibly deal with a wide range of sounds with out the necessity for architectural changes – making it a very good candidate for modeling multi-speaker dialogues.

Instance of a multi-speaker dialogue generated by NotebookLM Audio Overview and primarily based on some potato-themed paperwork.

Constructing on this analysis, our newest speech technology expertise can generate 2 minutes of dialogue with improved naturalness, speaker consistency and acoustic high quality when a dialogue script and speaker change markers are supplied. The mannequin additionally performs this activity in underneath 3 seconds on a single Tensor Processing Unit (TPU) v5e chip in an inference move. This implies audio is generated greater than 40 occasions quicker than actual time.

Scaling our audio technology fashions

Scaling our single-speaker technology fashions to multi-speaker fashions then grew to become a matter of information and mannequin capability. To assist our newest speech technology mannequin produce longer speech segments, we developed an much more environment friendly speech codec that compresses audio right into a sequence of tokens at simply 600 bits per second with out compromising the standard of the output.

Xem thêm  GenCast predicts the climate and dangers of utmost circumstances with state-of-the-art accuracy

The tokens generated by our codec have a hierarchical construction and are grouped by timeframe. The primary tokens inside a bunch seize phonetic and prosodic info, whereas the final tokens encode tremendous acoustic particulars.

Even with our new voice codec, making a two-minute dialogue requires producing over 5000 tokens. To mannequin these lengthy sequences, we developed a particular Transformer structure that may effectively course of info hierarchies and matches the construction of our acoustic tokens.

This method permits us to effectively generate acoustic tokens comparable to dialogue inside a single autoregressive inference move. As soon as generated, these tokens could be decoded again into an audio waveform utilizing our voice codec.

Animation exhibiting how our speech technology mannequin autoregressively generates a stream of audio tokens which are decoded again right into a waveform consisting of a two-speaker dialogue.

To show our mannequin to provide practical exchanges between a number of audio system, we pre-trained it on lots of of 1000’s of hours of speech knowledge. We then refined it utilizing a a lot smaller dialogue dataset with excessive acoustic high quality and correct speaker annotations, consisting of unscripted conversations from a variety of voice actors and practical inconsistencies – the ums and aahs of an actual dialog. This step taught the mannequin to reliably swap between audio system throughout generated dialogue and output solely studio-quality audio with practical pauses, tones, and timing.

In keeping with our AI rules and our dedication to the accountable growth and deployment of AI applied sciences, we combine our SynthID expertise to watermark non-volatile, AI-generated audio content material from these fashions to guard in opposition to potential misuse of this expertise to guard.

Xem thêm  AI educators are coming to this college – and it is part of a growth

New language experiences are coming

We're now specializing in enhancing the speech intelligibility and acoustic high quality of our mannequin, and including extra granular controls for options like prosody. On the identical time, we’re exploring how greatest to mix these advances with different modalities reminiscent of video.

The potential purposes for superior speech technology are huge, particularly when mixed with our Gemini household of fashions. From enhancing studying experiences to creating content material extra extensively accessible, we sit up for persevering with to push the boundaries of what’s attainable with voice-based applied sciences.

Acknowledgments

Authors of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharitonov, Alex Tudor, Victor Ungureanu, Karolis Misiunas, Sertan Girgin, Jonas Rothfuss, Jake Walker and Marco Tagliasacchi .

We thank Leland Rechis, Ralph Leith, Paul Middleton, Poly Pata, Minh Truong, and RJ Skerry-Ryan for his or her important efforts on dialogue knowledge.

We’re very grateful to our staff in Labs, Illuminate, Cloud, Speech, and YouTube for his or her wonderful work in turning these fashions into merchandise.

We additionally thank Françoise Beaufays, Krishna Bharat, Tom Hume, Simon Tokumine and James Zhao for his or her help with the venture.

[ad_2]

Supply hyperlink

By

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *