Swiss German text-to-speech has a straightforward problem: there is essentially no commercially usable training data. Five million people speak Swiss German, yet no labeled TTS dataset exists that you can freely train on. Voice actors are understandably reluctant to record audio for models that might replace them.
This forces a targeted fine-tuning approach. Rather than training a model from scratch, we adapted Qwen3-TTS — a strong open-source TTS model from Alibaba’s Qwen team — to produce natural Swiss German speech with limited data.
Key Takeaways
- We trained only the last 8 talker layers of Qwen3-TTS, leaving all earlier layers frozen.
- We created a dedicated speaker per dialect to capture dialect-specific acoustic characteristics.
- As a starting point for the dedicated speakers, we copied the most similar existing speaker from the base model to the dialect-specific speaker.
- The model’s instruction-following capabilities (expressive speech styles) transferred cleanly to Swiss German.
- Results shown here are for Züridüütsch — the Zürich dialect — since that’s what I speak.
The Architecture: What Are Talker Layers?
Qwen3-TTS separates text understanding from speech generation. The speech generation component — the talker — is a transformer stack that maps language representations to acoustic tokens. These tokens are then decoded to audio by a vocoder.
The talker layers closest to the output are the most acoustically specific: they encode the fine-grained phonetic texture of the target language. The early layers handle more general structure shared across languages.
Our approach: freeze all but the last 8 talker layers, and train those with the Swiss German data we have. This is memory-efficient, trains quickly, and preserves the model’s general speech quality while teaching it dialect-specific phonology.
One Speaker Per Dialect
Rather than trying to cover all Swiss German dialects in a single model, we created a dedicated speaker embedding per dialect. Each speaker is optimized for a specific regional variant. The model generates speech conditioned on the selected speaker.
The examples below use Züridüütsch (the Zürich dialect).
Results: Before and After
The base Qwen3-TTS model has no systematic exposure to Swiss German. When given Swiss German text, it maps to Standard German phonetics — the output is intelligible, but clearly off to native speakers.
Standard German phonetics applied to Swiss German text. The rhythm, vowel qualities, and consonant clusters are incorrect.
Natural Züridüütsch prosody and vowel rendering. The characteristic consonant clusters and melodic contour are correct.
Expressive Speech: Instruction-Following Survives Fine-Tuning
Qwen3-TTS accepts natural language instructions about speaking style. After fine-tuning, these capabilities transferred intact to Swiss German. I have chosen that sentence because it combines English, with some Swiss Named Entities and Swiss German syntax, so it is a good test of the model’s ability to handle code-switching and dialectal features while following style instructions.
The clean transfer of instruction-following confirms that the last-8-layer strategy preserves the model’s general capabilities while adapting only the acoustically specific components.
What’s Next
-
There is still a lot todo; it does by far not sound perfect yet. However, it seems to work that by targeted freezing of the early layers, we can adapt to a new dialect with limited data while preserving the model’s general speech quality and instruction-following capabilities.
-
Some Instruction is lost, e.g. see the “fear” example above.
Presented at the Basel Data Science & AI Meetup, April 2026. Work done at FHNW Institute for Data Science.