The Untold Story of Tacotron & The Replication Crisis

Tacotron (/täkōˌträn/): An end-to-end speech synthesis system by Google

In 2018, Tacotron-2 catalyzed the resurgence of the Text-to-Speech (TTS) industry. It stunned the world when Google commercialized it as Google Duplex, fueling a new wave of TTS companies like PlayHT, Resemble, Sonantic, Voicery, and WellSaid. Tacotron-2 showed that it was possible for the first time to synthesize truly life-like speech!

The publication made unprecedented claims, reporting a Mean Opinion Score (MOS) for naturalness that was statistically comparable to that of ground truth audio. This indicated that listeners graded real voice-over clips and synthetic clips as having the same level of naturalness—a stunning result. Furthermore, the authors published audio clips demonstrating the model's remarkable ability in pronunciation, intonation, and prosody.

The paper was unprecedented, though it did popularize the "Real or Fake?" quiz format in generative audio research. To be fair, the authors earned it — the results were incredible. But for the next few years, I was constantly nagged (much to my annoyance) by onlookers to present my own research as a quiz. It was a trend I could never seem to shake.

Yet, for all its influence, Tacotron-2 remains one of the most misunderstood publications in TTS history. While groundbreaking, its results were never successfully replicated. Independent researchers failed to achieve MOS scores anywhere near the original claims.

I have included a couple of results below; notice the low MOS scores that papers from Baidu, Microsoft, and NVIDIA reported for Tacotron-2 — an entire point lower in some cases.

Lab Paper MOS Score
Baidu Research Deep Voice 3 (Ping et al., 2018) 3.78 ± 0.34
Microsoft Research FastSpeech (Ren et al., 2019) 3.86 ± 0.09
NVIDIA Flowtron (Valle et al., 2020) 3.52 ± 0.17
Google Tacotron 2 (Shen et al., 2018) 4.52 ± 0.066
Recorded Audio (reference) 4.6 ± 0.1

To fully appreciate the significance of a single point, it's important to understand the vast difference in Mean Opinion Scores (MOS) for naturalness. Typically, a score of 3.5 represents a voice that is "good but unmistakably synthetic"—on par with the popular parametric TTS systems of the 2000s. A 4.5, in contrast, is "near-human and pleasant for extended listening," marking a monumental leap in quality.

Crucially, this means that independent labs were reporting that this seminal paper hadn’t performed any better than the older TTS systems of the previous decade! Insane.

This failure to replicate spanned two years, culminating in NVIDIA reporting one of the lowest scores Tacotron-2 had ever received in 2020—almost two full years after its original publication.

Eventually, the industry moved on to other models to achieve that milestone anew. Yet, Google continued to publish follow-ups to Tacotron-2 that reconfirmed its original claims. Google Duplex, the commercialization of Tacotron-2, leaves no doubt that Google did, in fact, reach that milestone.

But you have to wonder: why was it so difficult for other labs to replicate one of the biggest papers of the decade?

We definitely aren’t holding “tricks” back. We’ve always aimed for our work to be reproducible (and are excited to see so many implementations on GitHub) ... In our experience, Tacotron works well on both lower- and high-quality datasets.
— Tacotron-2 Authors

This is where my own story begins.

In 2018, I was a recent graduate from the University of Washington. PyTorch was not even two years old and still in beta (v0.3.1). The industry was hotly debating whether TensorFlow 2.0 would crush whatever momentum PyTorch had. Meanwhile, I was working at AI2, where NLP researchers were in a panic as the publication of Transformers overtook their work on ELMo.

Though I had no prior research experience—or commercial experience, for that matter—I was convinced of the value Tacotron-2 had unlocked. I was determined to replicate its results as part of founding WellSaid.

Naturally — or perhaps unnaturally — my first step was simply to reach out to the authors. Looking back, given the sheer number of failed reproduction attempts across the industry, I have to assume I was one of the few people who actually bothered to talk to them.

My outreach actually became somewhat of a running joke at the AI2 incubator where I worked. The idea that I just reached out—and later, that I pointed out a bug in their seminal paper — was viewed as hilarious by my peers. I laughed along, but I didn't see it as bold or novel; I just wanted to do my best to replicate the results.

I wasn’t sure if they would even respond. I had cold-emailed researchers before with mixed success and a low tolerance for my questions. But, having interned at Google, I remembered the internal culture relied heavily on GChat. I figured it was my best shot.

To my surprise, they responded.

But even with their help, as I diligently translated their paragraphs into functions over the next few months, it became clear that the paper alone was not sufficient. I ran into countless challenges. To give you some perspective on what I was up against, here are just a few of the "hidden" hurdles:

  • Dropout Regularization: The paper applies dropout to the convolutional layers, but I learned the authors did not apply it to the Post-Net as implied. Including it over-regularized the model.

  • Attention Mechanism: The description of Location Sensitive Attention was ambiguous. The crucial missing detail? The first LSTM layer must receive the previous attention context as input to predict the next one. Without this, the model doesn't work at all.

  • Zoneout Layer: The authors describe a novel Zoneout layer they leveraged; however, to my knowledge, there is no performant open-source implementation of Zoneout. I had to find an alternative.

  • Post-Processing: The paper did not discuss how to process Tacotron-2’s output to make it compatible with the vocoder, WaveNet. The solution, as I discovered, was both unintuitive and straightforward: it involved upsampling the output by another 4x and then repeating each value 75 times to achieve the required 300x scale-up. I would have never guessed that.

  • Mel Filterbank Parameters: The paper’s signal pipeline parameters — 80 channels and a 7.6 kHz max frequency — were actually a mistake. The intended configuration was 128 channels up to 12 kHz; otherwise, the model loses its high-end frequencies.

These clarifications were hard-fought. I did not have open access to the authors and I wanted to minimize disrupting them. So I did what I could to implement things as-is, experiment, and only when I felt crazy enough — and had isolated a specific detail — did I reach out for help. This process took months.

As a matter of fact, I only thought to ask about the 80 channels because, in one of my many run-throughs, I noticed how peculiar that number was. I had been taught, at the University of Washington, that no self-respecting computer scientist would pick a number that wasn’t a power of 2. Curiously, that “80-mel channel” typo was repeated in countless papers for at least the next half a decade. It was both entertaining and troubling to witness:

Paper Quote
FastSpeech (Ren et al., 2019) “…the output linear layer converts the 384-dim hidden into 80-dimensional mel-spectrogram.”
Deep Voice 3 (Ping et al., 2018) “We separately train a WaveNet vocoder treating mel-scale log-magnitude spectrograms … depth = 80.”
Transfer Learning TTS (Jia et al., 2018) “Target spectrogram features … passed through an 80-channel mel-scale filterbank.”
AlignTTS (Zeng et al., 2020) “For computational efficiency, we split 80-channel mel-spectrogram frames …”
Glow-TTS (Kim et al., 2020) “We adopt the Tacotron-2 configuration with an 80-channel mel-spectrogram target.”
Taco-LPCNet (Gong et al., 2021) “…the 80-channel mel-spectrogram outputs from Tacotron-2 are directly fed to LPCNet.”
ReFlow-TTS (Yang et al., 2023) “We extract the 80-bin mel-spectrogram with frame = 1024, hop = 256.”
SR-TTS (Feng et al., 2023) “…the decoder transfers the feature sequences into an 80-channel mel-spectrogram.”
Multi-Scale Spectrogram TTS (Amazon Labs, 2021) “We extracted 80 band mel-spectrograms with a frame shift of 12.5 ms.”

Even with their direct help, I needed more. I scoured the web for insights, finding GitHub threads with hundreds of comments from people wrestling with the same problems I was. For months, researchers had been arguing over experiments and trying to assemble the pieces. Yet, despite the massive amount of public discussion, a working replication remained elusive.

In particular, I learned that beyond the paper's key details, there are "tricks of the trade" you simply need to know. My mentor explained that these details are often considered "too uninteresting" to publish, yet they are crucial for results. Here is one such detail; without it, the model simply wouldn't work:

  • Silence Handling: While not mentioned in the paper, the model fails to train if there is excess silence at the start or end of a clip, or even just long pauses within it. This is deceptively complex to debug because the model and data aren't "broken" in the traditional sense—yet the resulting impact on the attention layer and stop token is catastrophic.

While the authors were well-intentioned—we even video chatted in 2018—it’s evident that papers aren't truly intended to be reproducible. If we think they are, we are kidding ourselves. Papers are summaries and highlights, often leaving out important but "uninteresting" details. Much like code documentation, they are often outdated compared to the actual implementation.

I actually had a fight about this very matter with my mentor a couple of years ago. I desperately wanted to share these “uninteresting” details with the research community as they were crucial to my results, but I was advised against it. I knew other researchers were making the same mistakes over and over, and I wanted to help get everyone on the right path—yet I wasn't allowed to speak plainly about it. This ultimately led me to crash out of research; I just couldn’t agree with that philosophy. It’s partly how I found myself here, starting a company instead of going after a PhD.

In total, this reality leaves the community in peril, forcing researchers to spend years piecing together the missing links.

But fortunately, thanks to the opportunity to discuss this with the original authors, I was able to replicate Tacotron-2 and confirm the validity of the work in early 2019. Please take a listen for yourself—it sounded like our model had finally achieved near-human results. By the middle of 2019, we were reporting MOS numbers comparable to the original paper, and nowhere near the lows that others were reporting.

We often talk about the "replication crisis" in machine learning in the abstract. But for me, it was a wall I hit immediately upon entering the field. Twice, across multiple years, I watched a charade play out in the industry: labs reporting wildly different metrics for the same architecture, and researchers copying the same mistaken parameters for five years straight.

This replication crisis, to me, isn’t a mystery; it’s a people-and-systems problem.

Researchers are attempting to summarize complex codebases and highlight relevant findings in only a few dense pages. Without the source code—and to be fair, even with it—the community is left to reverse-engineer the results. Then, even in the best of circumstances, a single ambiguity in one paper can lead hundreds of researchers down the wrong path for half a decade. It’d be hilarious if it weren’t so sad how much effort is wasted.

It leaves me wondering if there is a better way to organize publications. What if research papers were living documents and facilitated reproduction?

We could benefit from a layer where the community can add comments, corrections, and implementation details—a "GitHub for Papers," if you will—where we can collaborate on the replication process. Because right now, once a paper is published, the conversation stops. And as my story proves, that is exactly when the real work begins.