Phoebe's Voice: Building a Text-to-Speech Engine That Actually Sounds Turkish

27/04/2026

Reading Time: 2 minutes

Ece Ünal

Ece Ünal

Senior AI Software Engineer

Ali Kemal Coşkun

Ali Kemal Coşkun

AI Software Engineer

Modern text-to-speech systems are largely shaped by a fundamental question: how do you make synthesized speech sound natural?


Most TTS systems operate on characters or subword units. While this works reasonably well for many languages, it imposes significant limitations on Turkish. Turkish is phonetic in structure, but naturalness depends heavily on how sounds are timed relative to each other. Vowel length, syllable pacing, and stress patterns all carry linguistic weight. When these are flattened, speech can sound correct at the word level but still feel unnatural. This is the problem that Commencis engineers Ece Ünal and Ali Kemal Coşkun set out to solve when building Phoebe, Commencis’s Voice AI platform.

Their approach treats phonetic timing not as an emergent byproduct of acoustic generation, but as a core component of the system itself. The result is a hybrid architecture that combines a diffusion-based backbone with a dedicated phoneme-duration layer, yielding speech that is not only intelligible but also aligned with the natural rhythm of Turkish.

Read the full article to see how the system was designed, what trade-offs were made, and how Phoebe performs against widely used production-grade TTS systems.

Read on Medium

Don’t miss out the latestCommencis Thoughts and News.