
If you’re using Metavoicer, you’ll know how easily it can read text in over 70 languages, using a wide range of voices and accents. But how does a simple app on your smartphone manage to accomplish such a complex task? How, in short, does text-to-speech (TTS) technology work?
There are three stages to effective TTS. Firstly, the software needs to “understand” the text. Secondly, it needs to convert the words into sounds. Thirdly, it needs to produce those sounds as audio.
You may be wondering why the “understanding” phase is necessary. It would be possible just to record a sound for each word and play that back, but the result would be ugly, robotic, and difficult to listen to. It would also be open to misunderstanding.
For example, if the text includes the word “bow”, is that a polite bending of the waist, or a weapon for shooting arrows? The pronunciation is different according to the meaning. These “heteronyms”, as they’re called, are common in English. Is “lead” a clue in a detective story, or a heavy metal? Is “second” a short period of time, or is someone being sent to work in a different team?
It’s not just words. Numbers are spoken differently according to how they’re used. 1492 might be the year Columbus sailed the ocean blue, in which case it’s “fourteen ninety-two.” It might be the number of paper clips in a box, and then it’s “one thousand four hundred and ninety-two.” Or it might be a PIN code, and spoken “one four nine two.”
Computers decide on these pronunciations in the same way humans do, by looking at the context. If someone did something “in” a number that’s almost certainly a year, for example. And as with humans, they make the best guess they can base on the information given. This process is called “text normalization.”
There are two main approaches by which computers make these guesses. The first is based on “Hidden Markov Models.” Markov was a Russian mathematician who worked on chains of probability: if it’s raining today, is it more likely to be raining tomorrow? Then how do the chances of rain tomorrow affect the weather the day after? And the day after that? The answer is “Hidden” because in this case, there is a right answer, we just don’t know it yet—one pronunciation of “lead” is correct. Every time the Hidden Markov Model makes a decision about the meaning of a word, it affects the probability of other meanings.
The other approach is a “neural network.” This is a computer program designed to work as much like the human brain as possible. Think of ants exploring a garden… each ant makes a trail and is followed by other ants. The more useful a trail is, the more it’s used, and the more ants follow it. In the same way, the brain makes connections between things. Useful connections get reinforced, while useless ones die away from neglect.
Text normalization is the first stage of TTS. The software goes through the text eliminating any ambiguity and produces a new sequence of symbols. Those symbols then need to be turned into the building blocks of sound: “phonemes”.
Again, one sound per word is possible, but would not sound natural. Real speech flows, so one approach to reproducing this is to look at “diphones”, or groups of two sounds. For example, “Metavoicer” would be broken down into seven diphones: Me, et, ta, av, voi, oic, and cer. This method makes for less jerky, robotic speech because it focuses on the way sounds change into each other, rather than as separate “chunks”.
Once the software knows what sounds it needs to make, it can then produce the actual audio. There are three ways of tackling this, technically known as “concatenative”, “formant” and “articulatory” approaches.
“Concatenation” means joining together, so a concatentative program uses snippets of real human speech and stitches them together. This produces the most natural-sounding speech but has limitations. It can only reproduce the voices it has been given, which in practical terms means one voice (male or female) and one language. It’s often used where the voice has a limited number of things to say, such as rail station announcements.
Metavoicer uses “formant” technology because of its greater flexibility. A “formant” is a particular component of a voice that indicates a vowel sound. Consider, for example, a deep male voice and a high female voice. When they pronounce an “a” sound, the actual frequency of the soundwave will be very different, but we hear it as the same vowel. That’s because the sounds share the same formant.
Formant technology shapes the sound electronically, just like a synthesizer shapes sound to mimic a trumpet or piano. This means it’s easy to transform it. When you go into the “Tune” menu on Metavoicer and adjust the speed and tone of your audio, you are directly changing the synthesized sound, raising or lowering the pitch, compressing or expanding time.
The third approach to producing sound is the most difficult and is still an emerging technology. It’s called “articulatory” because it copies the way our bodies produce, or “articulate”, speech. Our voices are a highly complex set of noises, shaped by our vocal cords and the space in our heads as modified by teeth, tongue, and lips. Imitating each of those aspects should produce the most natural, human-sounding speech, but the complexity of it means that goal is still some way off.
Of course, you don’t need to understand what’s happening “under the hood” to use Metavoicer. Its user-friendly interface means anyone can jump straight in and start converting text to speech. But recognizing the complex processes that underlie such a seemingly simple thing as talking, can help us to appreciate the wonder that is our own human voice.