SwissText 2025 - Whisper Finetuning

Fine-tuning Whisper on sentence-level data looked good in validation, but failed in long-form production audio. The main improvement came from training for deployment reality: long context, memory constraints, and dialect variance.

Key Takeaways

Sentence-only fine-tuning can regress on real long-form ASR.
Synthetic stitched long-form audio improves robustness and timestamps.
Memory optimizations made large-model training practical on a single A100.

At SwissText 2025, I presented our work on fine-tuning OpenAI’s Whisper (Large-v2) for Swiss-German. The core of the challenge wasn’t just transcription, but translation-mapping diverse Swiss-German dialects to Standard German text.

The Summary

1. The “Real-World” Gap
Most available datasets are sentence-level, but real-world ASR (meetings, phone calls, medical discussions) is long-form. We found that models fine-tuned only on sentence-level data actually perform worse on long-form audio-often getting stuck transcribing endless sequences of periods (“…….”) or failing on timestamps.

2. Synthetic Long-Form Data
To fix this, we developed a pipeline to “stitch” sentence-level data into synthetic long-form recordings. This forced the model to learn the context and continuity required for real-world applications.

3. VRAM Optimization
Training a model as large as Whisper Large-v2 on a single A100 (40GB) is a memory nightmare. We successfully scaled our batch size from 1 to 16 by implementing:

Activation Checkpointing: Trading computation for memory.
8-Bit Optimizers: Reducing the VRAM footprint of the AdamW optimizer by ~9GB.

4. Results: Closing the Dialect Gap

Dialect-level WER comparison before and after fine-tuning.

Our fine-tuned model significantly outperformed the base Whisper model across all Swiss dialects. Notably, it showed much lower variance between tricky dialects like Valais, Bern, and Basel, and improved named-entity recognition (e.g., correctly transcribing “Brugg” instead of the Standard German “Brücke”).

Downloads & Resources

For the full technical deep-dive, you can download the original presentation or explore the code:

📄 Download Presentation PDF
🛠️ Code: whisper-prep & whisper-finetune

Vincenzo Timmel

Key Takeaways

The Summary

Downloads & Resources