Insight 12: Generative Voiceover, the Next MultiBillion Dollar Opportunity
This is the 12th and a special issue of the AI2 Incubator's Insights newsletter.
Previously, we discussed our reservations whether LLM agents are ready for prime time and the challenges for startups building tools for generative AI. We also discussed the opportunities for specialized foundation models, where specialization can be based on a domain (e.g. healthcare) or modality (e.g. audio). In this issue, we follow up with a deep dive into speech as a subdomain of audio, exploring the future of speech foundation models and generative AI for speech.
Generative AI for speech, a field that is traditionally referred to by names such as text-to-speech (TTS) and speech synthesis, has seen tremendous progress in recent years. TTS products are now capable of producing speech that is indistinguishable from humans in some settings. We have also seen impressive demonstrations of voice cloning, the task of training an AI to mimic a person’s voice. A casual observer of the field may have the impression that we have “solved” generative AI for speech. In this article we make the contrary point: the current speech synthesis technology is great at optimizing for naturalness (or minimizing synthetic-ness), but primitive in expressiveness. The grand challenge is to create
AI that is capable of voiceover, to approach the infinitely rich, nuanced, and expressive qualities of human spoken communication. We call this grand challenge Generative Voiceover (GVO), articulate key factors that pave the road to GVO, and share our optimism about this journey.
Voiceover: Inspirational Examples
- Jodie Foster as agent Starling, someone who, according to Hannibal Lecter, was trying to hide her West Virginia accent in the movie The Silence of the Lambs.
- Ronald Reagan (the Great Communicator) to Mihail Gorbachov: Mr Gorbachov, OPEN this gate. Mr Gorbachov, TEAR DOWN this wall.
- David Attenborough as narrator for BBC nature documentaries.
- Morgan Freeman in the Visa’s impossible deals commercial.
- Scarlett Johansson as Samantha, introducing herself in the movie Her.
We hope it is clear from the above clips that human speech has so much expressiveness that the current AI technologies can’t yet deliver. Current TTS products produce speech that sounds natural but is largely expressionless. They have virtually no control over the prosodic characteristics of the produced speech output: pitch, intonation, rhythm, stress, timing. The result is speech that sounds natural at the beginning but becomes repetitive and eventually tiring after a while. Voice cloning demos may seem cool, but the technology still has a ways to go to be ready for production, not to mention the risks of
misuse.
AI speech has a long way to go!
GVO: Potential Applications
If we can make significant progress towards GVO, we could use AI to unlock a wide range of applications. To get a sense of the possibilities, let’s consider some of the common use cases of voiceover, according to ChatGPT (the examples in parentheses are added by us):
Voiceover (often abbreviated as VO) refers to a production technique in which a voice, typically of a narrator or actor, is recorded and played over a video, film, or any visual content. It is commonly used in various media formats, including television shows, movies, documentaries, commercials, video games, and instructional videos.
Voiceovers serve several purposes, such as:
-
Narration: Providing information or storytelling to guide the audience through the visual content. Narrators often describe what is happening on the screen or provide additional context. (Example: David Attenborough.)
-
Character voices: In animations or video games, voice actors may lend their voices to different characters, giving them a unique personality and bringing them to life. (Example: Scarlett Johansson’s Samantha character.)
-
Advertisement and marketing: Voiceovers are used in commercials to deliver promotional messages, explain product features, or create a specific emotional response in the audience. (Example: Morgan Freeman’s Visa commercial.)
Voiceover recordings are typically done in professional recording studios, and the voice actor's performance can greatly influence the impact and effectiveness of the content. A well-done voiceover can enhance the overall viewing experience and add depth to the visuals, creating a more engaging and informative presentation.
This is a multi-billion dollar opportunity!
GVO: Product Requirements
Let's break down the traditional process for producing high-quality voiceovers and then outline how an AI system might mimic and innovate upon that process.
Traditional Voiceover Process:
Script Creation:
Every production starts with a script. This script contains not just the words to be spoken but also annotations on emphasis, pauses, emotions, and other nuances.Casting:
Voice actors are selected based on their voice characteristics, versatility, and ability to convey different emotions.Director’s Briefing:
Before recording starts, the director provides a briefing. This is where the nuances, emotions, tone, and pace are discussed. Sometimes, even the backstory of the content or character is explored to ensure the actor fully understands the intent.Recording:
Actors record their lines. Multiple takes are common to capture slight variations.Feedback Loop:
After recording, there's a feedback loop where the director provides input and the actor makes adjustments.Post-Production:
Once recording is finalized, there's editing for clarity, adjustments to the timing, and sometimes sound effects or background music are added.
AI Voiceover Process Blueprint:
AI Training:
A speech foundation model is trained on a vast amount of data to produce a wide range of vocal nuances, emotions, and accents. The training data consists of both paired (text, speech) samples as well as text-only and speech-only data. We could potentially use a pre-trained large language model as a starting point (see for example AudioPaLM).Script Input:
Users input their script into the system. The AI can allow for annotations similar to traditional scripts, such as emphatic
, whispering
, pause
, etc.Voice Selection:
Users can either select from preset voice profiles or adjust parameters to craft a unique voice. Parameters might include pitch, speed, accent, age, and more.Direction & Cues Interface:
The system provides an interface for users to give "directions" to the AI, much like a director would. This could involve adjusting emotion intensity, emphasizing certain parts, or changing pacing.AI Interpretation & Simulation:
Upon receiving directions and the script, the AI simulates a voiceover session, producing multiple takes with slight variations for user choice.Feedback Loop:
Users can listen to the AI-generated voiceovers and provide feedback. The system iteratively refines the voiceover based on user input. Advanced systems could even have a “live direction mode” where users adjust parameters in real-time while the AI "performs".Post-Production by AI:
Once the user is satisfied with the voiceover, the AI can handle post-production. This involves cleaning up the audio, adjusting timing, and even adding sound effects or background music as directed by the user.Personalized Learning:
Over time, the AI learns user preferences, making the direction process smoother and more intuitive for each subsequent session.Emotion & Context Analysis:
Advanced models can be equipped to understand the broader context of the script. For instance, if the content relates to a sad scene in a movie, the AI might automatically imbue the voiceover with a somber tone, subject to user adjustments.Integration with Visuals:
For applications like video games and movies, the AI system could integrate with the visual content, adjusting voiceovers based on visual cues, character actions, or scene dynamics.
While the AI blueprint builds upon the traditional process, it offers efficiencies, flexibility, and scalability that would be challenging in a traditional setting. As AI models become more sophisticated, the line between human and AI-generated voiceovers may blur, offering creators a wider range of tools to realize their visions.
GVO: Key Factors for Success
Can AI advance to support the features outlined above? To answer this question, let us look at some past lessons. Step-function progress in AI has historically been possible with a combination of three factors: compute, data, and algorithms. We anticipate that these factors will continue to play important roles in achieving GVO.
Compute
Progress in text-based GAI such as (Chat)GPT and image-based GAI such as Dalle/Stable Diffusion was possible because of the advent of GPUs optimized for neural network training workloads, invariably working in tandem in large clusters of many GPUs (hundreds of GPUs for Stable Diffusion, and thousands of GPUs for Llama, over a number of weeks). The results of such a large-scale training regime are neural networks with large numbers of parameters (1-3 billions in image models and 7-500 billions in text models). In contrast, today’s TTS models are likely trained on much smaller-scale GPU setups. It is thus reasonable to expect that as we scale up the compute for GVO, there’s a possibility to gain substantial performance improvement via much larger neural networks. (Current TTS models are estimated to have tens to low hundreds of millions of parameters.) The machines are on standby and ready for the next wave of training runs for GVO, we just need to make the necessary advances in data and algorithms.
Data
In text and image generative AI, we benefit from the availability of large quantities of data. Text models such as Llama train on trillions of tokens (a token is on average 3/4 of a word). Image models such as Stable Diffusion train on a few billion images (from the LAION dataset), many of which come with textual descriptions that play a crucial role in developing a text-to-image model.
When it comes to human speech, the situation is rather different. Current TTS systems typically train on relatively small datasets. We estimate that they train on no more than 50K hours of speech, a small fraction of which has the corresponding texts (paired data). Given the rate of 10K words per hour, we are looking at an estimate of less than 500M spoken words, which is 2,000 times less than large language models. Production TTS systems likely train on much smaller paired datasets, trading off quantity for quality. It is interesting to ask if scaling up training data for speech could result in a step-function improvement. In other words, what would an emergent capability look like when we scale up training speech models with much more data?
Algorithms
In text and image modalities, the abundant availability of data coupled with advances in algorithms (autoregressive token formulation and the attention mechanism for text, thermodynamic-inspired diffusion algorithms for image) led to huge advances. In speech, while we could potentially increase the volume of data from the current 50K hours mark, we are still early in our experimentations inventing the GVO’s equivalent of unsupervised learning techniques such as autoregression/diffusion (e.g. Google’s AudioLM, Meta’s Voicebox, ProDiff, Diff-TTS etc.) or the hugely influential attention (aka transformers) mechanism.
Intuitively, speech shares both the sequential/autoregressive characteristic with text and (via the spectrogram representation) the potential applicability of diffusion techniques with image. In the past decade we have seen algorithmic ideas transcend modalities. There is a good reason for us to be optimistic that this applies to the next wave of advances in speech as well. Make no mistake, human speech is infinitely rich. To achieve GVO, we need to harness and extend ideas and techniques from both text and image modalities and inventing speech-specific techniques.
Lastly, we highlight the important role of annotated prosodic data to achieve GVO. It is very unlikely to achieve GVO without showing the model how intonation, stress, timing, breathing, etc. sound like (we don’t have a proof of this claim; we only point out that systems such as ChatGPT crucially train on a high quality labeled dataset via supervised fine tuning and reinforcement learning by human feedback). We expect any GVO-ambitious company would consider it strategic to continuously curate a high quality annotated prosodic dataset and the associated harnesses (data pipelines, evaluation metrics, neural network architectural innovations, etc.) to maximize the chance of success.
Conclusion
The current crop of TTS systems produce natural speech but are unable to color it with emotion and nuanced expressiveness found in human-produced voiceover. Using photography as a metaphor basis, we are in the black-and-white age of AI generated speech. The goal of GVO is to add color, unlocking a wide range of applications from narration to character voices. While we have powerful GPT clusters that can significantly scale up speech models, we still need to make significant advances in data and algorithms to train a Speech Foundation Model (SFM) that realizes GVO. We are optimistic about our abilities to make significant advances on this road in the next several years.
Appendix: What is Prosody in Speech?
Prosody, in the context of speech, refers to the patterns of pitch, rhythm, intonation, stress, and timing that give spoken language its melodic and expressive qualities. It is the musicality of speech and the way in which various elements come together to convey meaning, emotion, and emphasis beyond the actual words being spoken.
Key components of prosody include:
- Pitch: The highness or lowness of a person's voice while speaking. Pitch variations can indicate questions (rising pitch at the end of a sentence), exclamations (sudden rise in pitch), or emphasis on certain words.
- Rhythm: The pattern of stressed and unstressed syllables in speech. It helps create a natural flow and cadence, making speech more engaging and easier to understand.
- Intonation: The variation in pitch and tone across a sentence or phrase. It can convey different emotions, attitudes, or intentions, such as excitement, surprise, sarcasm, or uncertainty.
- Stress: The emphasis placed on certain words or syllables within a sentence. By emphasizing particular words, speakers can alter the meaning or intention of their message.
- Timing: The pace and speed at which speech is delivered. Pauses between phrases or words can add meaning, allow for turn-taking in conversations, or give listeners time to process information.
Prosody plays a crucial role in effective communication as it helps convey nuances of meaning and emotions that might not be evident from the words alone. For instance, consider how the simple sentence "I didn't say you were stupid" can have different meanings depending on the speaker's prosody:
- "
I
didn't say you were stupid." (Someone else said it, not me.) - "I
didn't
say
you were stupid." (I implied it, but I didn't say it explicitly.) - "I didn't say
you
were stupid." (I said someone else was stupid, not you.)
In each case, the intended meaning changes due to the prosody used. Prosody is an essential aspect of natural and expressive speech, contributing to effective communication and better understanding between individuals.
Stay up to date with the latest A.I. and deep tech reports.