Qwen3-TTS for Swiss German: Dialect Adaptation via Talker Layer Fine-Tuning

Swiss German text-to-speech has a straightforward problem: there is essentially no commercially usable training data. Five million people speak Swiss German, yet no labeled TTS dataset exists that you can freely train on. Voice actors are understandably reluctant to record audio for models that might replace them.

This forces a targeted fine-tuning approach. Rather than training a model from scratch, we adapted Qwen3-TTS — a strong open-source TTS model from Alibaba’s Qwen team — to produce natural Swiss German speech with limited data.

Key Takeaways

We trained only the last 8 talker layers of Qwen3-TTS, leaving all earlier layers frozen.
We created a dedicated speaker per dialect to capture dialect-specific acoustic characteristics.
As a starting point for the dedicated speakers, we copied the most similar existing speaker from the base model to the dialect-specific speaker.
The model’s instruction-following capabilities (expressive speech styles) transferred cleanly to Swiss German.
Results shown here are for Züridüütsch — the Zürich dialect — since that’s what I speak.

The Architecture: What Are Talker Layers?

Qwen3-TTS separates text understanding from speech generation. The speech generation component — the talker — is a transformer stack that maps language representations to acoustic tokens. These tokens are then decoded to audio by a vocoder.

The talker layers closest to the output are the most acoustically specific: they encode the fine-grained phonetic texture of the target language. The early layers handle more general structure shared across languages.

Our approach: freeze all but the last 8 talker layers, and train those with the Swiss German data we have. This is memory-efficient, trains quickly, and preserves the model’s general speech quality while teaching it dialect-specific phonology.

Text Encoder

frozen

Early Talker

frozen

Mid Talker

frozen

Last 8 Talker

trained

One Speaker Per Dialect

Rather than trying to cover all Swiss German dialects in a single model, we created a dedicated speaker embedding per dialect. Each speaker is optimized for a specific regional variant. The model generates speech conditioned on the selected speaker.

The examples below use Züridüütsch (the Zürich dialect).

Results: Before and After

The base Qwen3-TTS model has no systematic exposure to Swiss German. When given Swiss German text, it maps to Standard German phonetics — the output is intelligible, but clearly off to native speakers.

Base Model (no fine-tuning)

Standard German phonetics applied to Swiss German text. The rhythm, vowel qualities, and consonant clusters are incorrect.

Fine-Tuned (Züridüütsch speaker)

Natural Züridüütsch prosody and vowel rendering. The characteristic consonant clusters and melodic contour are correct.

❌ Base Qwen3-TTS — no fine-tuning

"Mier chöntet ja au in Mac Take Away gah am halbi achti und eus am nüni am Bürkliplatz traffe."

✅ Fine-Tuned — Züridüütsch speaker

"Mier chöntet ja au in Mac Take Away gah am halbi achti und eus am nüni am Bürkliplatz traffe."

Expressive Speech: Instruction-Following Survives Fine-Tuning

Qwen3-TTS accepts natural language instructions about speaking style. After fine-tuning, these capabilities transferred intact to Swiss German. I have chosen that sentence because it combines English, with some Swiss Named Entities and Swiss German syntax, so it is a good test of the model’s ability to handle code-switching and dialectal features while following style instructions.

🎭 "Speak like you are suppressing a laughter"