Phoebe's Voice: Building a Text-to-Speech Engine That Actually Sounds Turkish
Reading Time: 2 minutes
Ece Ünal
Senior AI Software Engineer
Ali Kemal Coşkun
AI Software Engineer
Modern text-to-speech systems are largely shaped by a fundamental question: how do you make synthesized speech sound natural?
Most TTS systems operate on characters or subword units. While this works reasonably well for many languages, it imposes significant limitations on Turkish. Turkish is phonetic in structure, but naturalness depends heavily on how sounds are timed relative to each other. Vowel length, syllable pacing, and stress patterns all carry linguistic weight. When these are flattened, speech can sound correct at the word level but still feel unnatural. This is the problem that Commencis engineers Ece Ünal and Ali Kemal Coşkun set out to solve when building Phoebe, Commencis’s Voice AI platform.
Their approach treats phonetic timing not as an emergent byproduct of acoustic generation, but as a core component of the system itself. The result is a hybrid architecture that combines a diffusion-based backbone with a dedicated phoneme-duration layer, yielding speech that is not only intelligible but also aligned with the natural rhythm of Turkish.
Read the full article to see how the system was designed, what trade-offs were made, and how Phoebe performs against widely used production-grade TTS systems.
Reading Time: 2 minutes
Don’t miss out the latestCommencis Thoughts and News.
Ece Ünal
Senior AI Software Engineer
Ali Kemal Coşkun
AI Software Engineer

