Skip to main content

Modern neural TTS

· 12 min read
Vadim Nicolai
Senior Software Engineer

Modern neural TTS uses deep learning models (like Tacotron or FastSpeech) to generate natural-sounding speech directly from text, replacing older concatenative or parametric approaches. It produces more expressive, human-like voices with lower latency and better prosody control.

The biggest surprise in modern TTS isn't WaveNet's quality. It's FastSpeech's 270× speedup (Ren et al., 2019) laying bare a hard truth: autoregressive decoders were the bottleneck all along. Once you remove them, the whole field accelerates. Then VALL-E (Wang et al., 2023) raises the stakes again — needing 50,000 hours of speech to do what a diffusion model with under 1,000 hours claims to beat (Sample-Efficient Diffusion, Vyas et al., 2024). That tension — speed vs. data, parallel vs. flexible — is the real story. This post distills the architectures, the numbers, and the trade-offs the literature has nailed down.

The Great Acceleration: Why Autoregression Had to Go

FastSpeech pipeline: Text → Duration Predictor → Length Regulator → Transformer Decoder → Mel & Vocoder. All steps run in parallel — no autoregressive loop.

Before 2019, neural TTS was dominated by autoregressive sequence-to-sequence models like Tacotron. They produced great prosody but were slow — generation was sequential, one frame at a time. Then FastSpeech: Fast, Robust and Controllable Text to Speech (Ren et al., 2019) dropped the autoregressive decoder entirely, replacing it with a feed-forward Transformer plus a length regulator and duration predictor. The result: a 270× speedup for mel-spectrogram generation and a 38× end-to-end speedup over the autoregressive Transformer baseline. Word skipping and repetition — the bane of auto-regressive models — virtually disappeared. The ablation showed the length regulator was the critical component; without it, alignment falls apart.

The headline number — 270× — is for mel-spectrogram generation; the end-to-end speedup, with a neural vocoder in the loop, is the 38× figure. That's the difference between a chatbot that makes you wait and one that feels instant. The trade-off is that the duration predictor is distilled from the teacher model and inherits its alignment errors: if the teacher has weird pauses, FastSpeech will too. The paper's ablations confirm that without the length regulator, alignment collapses.

Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis (Wei et al., 2020, 106 cites) took a different angle — keep autoregression for block-level dependencies, but remove spectrograms and model waveform directly with normalizing flows. It achieved quality "approaching a state-of-the-art cascade TTS system" with significantly faster generation. But "approaching" is not "matching", and the inter-block autoregression still limits throughput — which is why Wave-Tacotron reads as a research curiosity rather than a deployment candidate.

FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis (Kobayashi et al., 2021, 19 cites) further refined the parallel paradigm by splitting pitch and spectral features into separate branches based on source-filter theory. This prevents audio-quality degradation under large pitch-shift scales — critical for voice-changing applications. The decomposition is principled, but it also prevents the model from learning complex joint representations. For standard TTS, FastSpeech 2 remains the better default.

Diffusion's Quiet Revolution: Less Data, More Control

While FastSpeech was optimising for speed, diffusion models were quietly solving the data-efficiency problem. Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance (Kim et al., 2021, 136 cites) combined an unconditional diffusion model with a separately trained phoneme classifier, using norm-based scaling to reduce pronunciation errors. It achieved performance comparable to Grad-TTS without any transcript for the target speaker. The catch? You need a phoneme classifier trained on large-scale ASR data — an extra dependency that can be fragile.

Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPU (Popov et al., 2022, 9 cites) attacked the speed problem of diffusion. By applying progressive distillation, GAN-based diffusion modelling, and latent-space score modelling, they achieved up to a 4.5× speedup over vanilla Grad-TTS and a real-time factor of 0.15 on CPU. That's fast enough to run a voice assistant on a laptop without a GPU.

The trade-off is quality — the speedup comes at a cost the paper doesn't quantify in MOS terms, but distilled diffusion models generally trade some high-frequency fidelity for speed, which is acceptable for most interactive use cases and not for studio production.

The real eye-opener is Sample-Efficient Diffusion for Text-To-Speech Synthesis (Vyas et al., 2024, 5 cites). This latent diffusion model uses a U-Audio Transformer (U-AT) operating on compressed latents from a pre-trained audio autoencoder. The key number: it was trained on less than 1,000 hours of speech — under 2% of VALL-E's 50,000 hours — and still synthesises "more intelligible speech than VALL-E". That's a direct challenge to the "scale is all you need" dogma. The ablation confirms that the U-AT architecture efficiently scales to long sequences, and latent diffusion dramatically reduces data requirements. If you're building a TTS system for a low-resource language or a niche domain, this is the paper to build on.

The Zero-Shot Wars: Codec LM vs. Diffusion

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E, Wang et al., 2023, 1,191 cites) kicked off a paradigm shift: treat TTS as conditional language modelling over discrete codes from a neural audio codec. With 50,000 hours of training data and a 3-second enrolled prompt, VALL-E performs in-context learning — it can mimic the speaker's emotion and acoustic environment from a single short sample. The ablation shows that scaling data is the enabler; smaller models fail at zero-shot generalisation.

However, inference is autoregressive over discrete tokens, which can be slow. Moreover, the data requirement is prohibitive for most teams: VALL-E-style architectures trained on a fraction of that corpus produce garbled output, which is precisely why the diffusion alternative below matters.

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation (Chen et al., 2025, 53 cites) tries to have it both ways. It uses a divide-and-conquer strategy: a language model processes aggregated patch embeddings, and a diffusion transformer generates the next patch. The result is state-of-the-art in zero-shot speech generation for robustness, speaker similarity, and naturalness. The ablation shows that temperature — defined as the noise introduction time in the reverse ODE — controls the diversity-determinism trade-off. The computational load is high, but patching reduces it. For a production system that needs both quality and controllability, DiTAR is promising.

On the diffusion side, Grad-StyleSpeech: Any-Speaker Adaptive Text-to-Speech Synthesis with Diffusion Models (Kong et al., 2022, 64 cites) conditions a diffusion model on a few seconds of reference speech and significantly outperforms earlier speaker-adaptive TTS baselines (relative gain is clear in listening tests, though no absolute MOS numbers are provided). A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling (Choi et al., 2022, 14 cites) takes a single diffusion model with two encoders (text and speech) and a shared decoder. It requires only 15 seconds of untranscribed audio and 3 minutes of GPU time for adaptation — a far cry from VALL-E's 50k hours. For low-data voice-cloning use cases this unified system is the pragmatic choice; quality is competitive but not yet at VALL-E's naturalness.

The tension is unresolved: codec-LM models give emergent zero-shot generalisation but are data-hungry and slow; diffusion models are data-efficient and can run on CPU but haven't matched the scale of generalisation. DiTAR suggests a hybrid future, but complexity increases.

Prosody: The Last Frontier

Even the best neural TTS sounds robotic if prosody is flat. Several papers have tackled this, each taking a different philosophy about explicit vs. implicit control.

Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis (Skerry-Ryan et al., 2018, 155 cites) proposed prosody embedding networks with temporal structures — frame-level or phoneme-level — plugged into an end-to-end TTS network. It enables frame- and phoneme-level control of pitch and amplitude and improves robustness against speaker perturbations during prosody transfer. The ablation shows that temporal normalisation is key to preventing speaker leakage. However, it requires training with target speech as supervision and lacks explicit prosody labels, making it hard to script.

Controllable neural text-to-speech synthesis using intuitive prosodic features (Henter et al., 2020, 70 cites) takes the explicit route: condition a seq-to-seq model on five acoustic features — pitch, pitch range, phone duration, energy, spectral tilt — to learn a latent prosody space. The MOS was 4.23 against a Tacotron baseline of 4.26 — essentially no degradation while providing meaningful control. Ablation showed each feature contributes independently. This is the most practical approach for controlled TTS today: scriptable pitch curves per utterance, near-baseline quality, and an MOS gap small enough to be inaudible to most listeners.

Fine-Grained Prosody Modeling in Neural Speech Synthesis Using ToBI Representation (Lian et al., 2021, 16 cites) predicts ToBI (Tones and Break Indices) labels from text and uses them as additional input for syllable-level prosody control. It produces more natural speech than Tacotron and unsupervised baselines, and allows effective control of stress, intonation, and pause. The downside: you need a separate ToBI prediction module and annotated data. If you have a linguistics team, go for it; otherwise, the prosody features approach is easier.

Interestingly, Uncovering Latent Style Factors for Expressive Speech Synthesis (Wang et al., 2017, 53 cites) and Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis (Sun et al., 2021, 25 cites) show that unsupervised methods can also work, but they lack fine-grained control. For most engineers, explicit conditioning with a small MOS hit is the right trade-off.

Decision Framework: Choose Your Weapon

Based on the evidence, here's a grounded decision framework for selecting a neural TTS architecture:

RequirementRecommended ModelKey Evidence
Real-time inference on consumer hardwareFastSpeech 2 + HiFi-GANFastSpeech achieves 270× speedup (Ren et al., 2019); Fast Grad-TTS achieves RTF 0.15 on CPU (Popov et al., 2022)
Zero-shot voice cloning with minimal dataUnified Diffusion System (15 s adaptation, 3 min GPU)Unified System for Voice Cloning (Choi et al., 2022)
High-quality zero-shot with large budgetVALL-E (50k hr data) or DiTARVALL-E sets zero-shot standard (Wang et al., 2023); DiTAR reaches SOTA with patching (Chen et al., 2025)
Data-efficient training (under 1k hours)Sample-Efficient Diffusion (U-AT)Beats VALL-E in intelligibility with under 2% of the data (Vyas et al., 2024)
Fine-grained prosody controlIntuitive prosodic features (pitch, duration, energy)MOS 4.23 vs Tacotron baseline 4.26 (Henter et al., 2020)
Production voice cloningGrad-StyleSpeech or Unified SystemOutperforms speaker-adaptive baselines (Kong et al., 2022; Choi et al., 2022)

This framework is not exhaustive, but it saves you from chasing the wrong paper. Shoehorning VALL-E into a sub-1,000-hour application is the common anti-pattern — the data gap shows up immediately as garbled output. Pick a data-matched architecture first; everything else is downstream.

Practical Takeaways

  1. Remove autoregression if you care about latency. FastSpeech and its variants are the default for real-time applications. The 270× speedup isn't just a benchmark — it means you can run TTS on a $50 CPU.

  2. Diffusion is the future for small data scenarios. The Sample-Efficient Diffusion paper proves you can beat massive autoregressive models with less than 1,000 hours and a U-AT. If you're building for a low-resource language, start there.

  3. Zero-shot generalisation requires scale or a hybrid approach. VALL-E's 50,000-hour data is out of reach for most teams. DiTAR's patch-based hybrid might bridge the gap, but it's still complex. For now, the unified diffusion system with 15 seconds of adaptation is the most practical cloning method.

  4. Prosody control is a solved problem — at a small quality cost. Explicit feature conditioning gives you 95% of Tacotron's naturalness with full control. Use it; your users will appreciate being able to adjust speaking rate and pitch.

  5. Watch for alignment errors in distilled models. FastSpeech's duration predictor inherits teacher model flaws. Always validate with a few hundred utterances before deploying.

Where the Field Is Headed

The tensions I highlighted — scale vs. sample efficiency, speed vs. flexibility, explicit vs. implicit prosody — aren't going away. But I see convergence. DiTAR shows that combining autoregressive and diffusion paradigms can give the best of both worlds. Sample-Efficient Diffusion suggests that pre-trained autoencoders can drastically cut data needs. And the prosody literature is moving toward hybrid models that learn implicit style tokens but allow explicit override.

For the practitioner, the message is clear: don't commit to one paradigm. Build a modular pipeline where you can swap the acoustic model (FastSpeech for speed, diffusion for quality) and the vocoder independently. The field is moving too fast to bet on a single architecture.

Final thought: the best neural TTS in 2025 is not the one with the highest MOS on a standard dataset. It's the one that works with your data, your latency budget, and your control requirements. Measure twice, deploy once.