The Untold Story of Tacotron

In 2018, Tacotron-2 emerged as a catalyst for the resurgence of the Text-to-Speech (TTS) industry. It stunned the world when Google commercialized it as Google Duplex and fueled a new wave of TTS companies like PlayHT, Resemble, Sonantic, and WellSaid. For the first time, Tacotron-2 showed that it was possible to synthesize truly life-like speech!

The publication made unprecedented claims, reporting a Mean Opinion Score (MOS) for naturalness that was statistically comparable to that of ground truth audio. The authors published audio clips demonstrating the model's remarkable ability in pronunciation, intonation, and prosody. They also kicked off what would become an annoying trend: quizzes that challenged people to guess whether a clip was synthetic or not.

Yet, for all its influence, Tacotron-2 is one of the most misunderstood publications in TTS history. While the paper was groundbreaking, its results were never successfully replicated by independent researchers, who were unable to achieve a MOS score anywhere close to the same level.

Lab Paper MOS Score
Baidu Research Deep Voice 3 (Ping et al., 2018) 3.78 ± 0.34
Microsoft Research FastSpeech (Ren et al., 2019) 3.86 ± 0.09
NVIDIA Flowtron (Valle et al., 2020) 3.52 ± 0.17
Google Tacotron 2 (Shen et al., 2018) 4.52 ± 0.066
Recorded Audio (reference) 4.6 ± 0.1

To appreciate what this data shows, it's important to understand the vast difference between Mean Opinion Scores (MOS). A score of 3.5 represents a voice that is "good but unmistakably synthetic"—on par with the popular parametric TTS systems of the 2000s. A 4.5, in contrast, is "near-human and pleasant for extended listening," marking a monumental leap in quality. This context is critical, as the data demonstrates that independent labs were unable to prove the seminal paper performed any better than those older, synthetic systems.

After many years of trying, the industry moved on to other models that achieved that milestone anew. For its part, Google continued to publish follow-ups to Tacotron-2 that reconfirmed their original claims. Google’s subsequent commercialization of the technology leaves no doubt that they did, in fact, reach that milestone. However, it’s evident that something from the original paper was missing.

In our experience, Tacotron works well on both lower- and high-quality datasets ... We definitely aren’t holding “tricks” back. We’ve always aimed for our work to be reproducible (and are excited to see so many implementations on GitHub).
— Tacotron-2 Authors

This is where my own story begins. In 2018, I was a recent graduate from the University of Washington. Though I had no prior research experience, I was convinced of the value Tacotron-2 had unlocked and became determined to replicate its results as part of founding WellSaid. My first step was to reach out to the paper’s authors to seek clarity.

I wasn’t sure if the authors would respond. Having interned at Google, I remembered they used GChat for internal communications and figured it was my best shot. To my surprise, it worked.

This kicked off a couple-month collaboration. As I tried to translate paragraphs into functions, it quickly became clear that the paper was missing critical and non-critical details. Working together with the authors, I began to amass a list of implementation details that were missing from the original paper:

  • Learning Rate Decay: The paper states when decay starts, but not when it ends (at 100k steps). Without this, the model overfits. The authors noted this was a dataset-dependent hyperparameter.

  • Dropout Regularization: The paper applies dropout to the convolutional layers, but I learned the authors did not apply it to the Post-Net. Including it would over-regularize the model.

  • Mel Filterbank: The paper’s parameters (80 channels, 7.6 kHz max frequency) were a mistake. The intended configuration was 128 channels up to 12 kHz; otherwise, the model loses its high-end frequencies.

  • Attention Mechanism: The paper’s description of its Location Sensitive Attention was ambiguous. The crucial missing detail? The first LSTM layer must receive the previous attention context as input to predict the next one. Without this, the model doesn't work at all.

  • WaveNet Input: It was unclear how to feed Tacotron-2's output into the WaveNet vocoder. The solution was to upsample the output by another 4x, then repeat each value 75 times to achieve the required 300x scale-up.

While working with the authors, I also reviewed other open-source implementations of Tacotron-2 to gain a deeper understanding of what was missing.

I discovered important tips and tricks, for example, the importance of trimming your audio clips as the model otherwise would destabalize as it has no idea when to stop.

I discovered inherent differences between Tensorflow and librosa which would have caused differences in how spectrograms are computed.

Please take a listen for yourself, but despite many other companies struggling to replicate Tacotron-2, on Mar 6, 2019, it sure sounded like our model had achieved near-human results.

Tacotron (/täkōˌträn/): An end-to-end speech synthesis system by Google

As an aside, knowing that 80-mel channels was a mistake, I found it entertaining throughout the years of watching this saga unfold to see papers adopt this number without question for years.

Paper Quote
FastSpeech (Ren et al., 2019) “…the output linear layer converts the 384-dim hidden into 80-dimensional mel-spectrogram.”
Deep Voice 3 (Ping et al., 2018) “We separately train a WaveNet vocoder treating mel-scale log-magnitude spectrograms … depth = 80.”
Transfer Learning TTS (Jia et al., 2018) “Target spectrogram features … passed through an 80-channel mel-scale filterbank.”
AlignTTS (Zeng et al., 2020) “For computational efficiency, we split 80-channel mel-spectrogram frames …”
Glow-TTS (Kim et al., 2020) “We adopt the Tacotron-2 configuration with an 80-channel mel-spectrogram target.”
Taco-LPCNet (Gong et al., 2021) “…the 80-channel mel-spectrogram outputs from Tacotron-2 are directly fed to LPCNet.”
ReFlow-TTS (Yang et al., 2023) “We extract the 80-bin mel-spectrogram with frame = 1024, hop = 256.”
SR-TTS (Feng et al., 2023) “…the decoder transfers the feature sequences into an 80-channel mel-spectrogram.”
Multi-Scale Spectrogram TTS (Amazon Labs, 2021) “We extracted 80 band mel-spectrograms with a frame shift of 12.5 ms.”
Next
Next

Hello World