Speech synthesis programs convert written input to spoken output by generating synthetic speech. These are often referred to as Text-to-Speech conversions (TTS).
There are several ways to perform speech synthesis:
- Record the voice of a person saying the required phrases
- The use of algorithms that split speech into smaller pieces. Often pieces are split into 35-50 phonemes (smallest linguistic unit). This decreases the quality though, due to the complexity of combining them once again in a fluent speech pattern.
- The most developed method is the use of diphones, which splits phrases not at the transition but at the center of the phonemes, which leave the transition intact. This results in 400 separate usable elements and a better quality product.
Performing speech synthesis with the methods above is said to be using concatenative processes. Concatenative TTS uses human quality wave files to generate the speech into a TTS string. These systems can be large in size and require lots of drive space to run, but offer a more natural sounding output.
Another method, synthesized TTS, creates speech by generating sounds through a digitized speech format. This output sounds more like a computer than a human, but can be run using just a few megabytes of space.
Products, whether concatenative or synthesized, are usually measured by their intelligibility, naturalness and test preprocessing capabilities (ability to convert acronyms into normal speech).
Additional sources of information*
* The WAVE Report is not responsible for content on external websites