Text-to-Speech Synthesis

In this chapter we introduce the three main synthesis techniques which dominated the field up until the late 1980s, collectively known as first-generation techniques. Even though these techniques are used less today, it is still useful to discuss them because, apart from simple historical interest, they give us an understanding of why today s systems are configured the way they are. As an example, we need to know why today s dominant technique of unit selection is used rather than the more-basic approach which would be to generate speech waveforms from scratch . Furthermore, modern techniques have been made possible only by vast increases in processing power and memory, so in fact, for applications that require small footprints and low processing cost, the techniques explained here remain quite competitive.
First-generation techniques usually require a quite-detailed, low-level description of what is to be spoken. For purposes of explanation, we will take this to be a phonetic representation for the verbal component, together with a time for each phone and an F0 contour for the whole sentence. The phones will have been generated by a combination of lexical lookup, G2T rules and post-lexical processing (see Chapter 8), while the timing and F0 contour will have been generated by a classical prosody algorithm of the type described in Chapter 9. It is often convenient to place this information in a new structure called a synthesis specification. Hence the specification is the input to the synthesiser and...