SwissText 2025 - Whisper Finetuning
Fine-tuning Whisper on sentence-level data looked good in validation, but failed in long-form production audio. The main improvement came from training for deployment reality: long context, memory constraints, and dialect variance.
Key Takeaways
- Sentence-only fine-tuning can regress on real long-form ASR.
- Synthetic stitched long-form audio improves robustness and timestamps.
- Memory optimizations made large-model training practical on a single A100.
At SwissText 2025, I presented our work on fine-tuning OpenAI’s Whisper (Large-v2) for Swiss-German. The core of the challenge wasn’t just transcription, but translation-mapping diverse Swiss-German dialects to Standard German text.
The Summary
1. The “Real-World” Gap
Most available datasets are sentence-level, but real-world ASR (meetings, phone calls, medical discussions) is long-form. We found that models fine-tuned only on sentence-level data actually perform worse on long-form audio-often getting stuck transcribing endless sequences of periods (“…….”) or failing on timestamps.
2. Synthetic Long-Form Data
To fix this, we developed a pipeline to “stitch” sentence-level data into synthetic long-form recordings. This forced the model to learn the context and continuity required for real-world applications.
3. VRAM Optimization
Training a model as large as Whisper Large-v2 on a single A100 (40GB) is a memory nightmare. We successfully scaled our batch size from 1 to 16 by implementing:
- Activation Checkpointing: Trading computation for memory.
- 8-Bit Optimizers: Reducing the VRAM footprint of the AdamW optimizer by ~9GB.
4. Results: Closing the Dialect Gap
Dialect-level WER comparison before and after fine-tuning.
Our fine-tuned model significantly outperformed the base Whisper model across all Swiss dialects. Notably, it showed much lower variance between tricky dialects like Valais, Bern, and Basel, and improved named-entity recognition (e.g., correctly transcribing “Brugg” instead of the Standard German “Brücke”).
Downloads & Resources
For the full technical deep-dive, you can download the original presentation or explore the code:
- 📄 Download Presentation PDF
- 🛠️ Code: whisper-prep & whisper-finetune