Text-to-Speech Synthesis

We now turn to the problem of how to convert the discrete, linguistic, word-based representation generated by the text-analysis system into a continuous acoustic waveform. One of the primary difficulties in this task stems from the fact that the two representations are so different in nature. The linguistic description is discrete, the same for each speaker for a given accent, compact and minimal. By contrast, the acoustic waveform is continuous, is massively redundant, and varies considerably even between utterances with the same pronunciation from the same speaker. To help with the complexity of this transformation, we break the problem down into a number of components. The first of these components, pronunciation, is the subject of this chapter. While specifics vary, this can be thought of as a system that takes the word-based linguistic representation and generates a phonemic or phonetic description of what is to be spoken by the subsequent waveform-synthesis component. In generating this representation, we make use of a lexicon, to find the pronunciations of words we know and can store, and a grapheme-to-phoneme [1] ( G2P) algorithm, to guess the pronunciations of words we don t know or can t store. After doing this we may find that simply concatenating the pronunciations for the words in the lexicon is not enough; words interact in a number of ways and so a certain amount of post-lexical processing is required. Finally, there is considerable choice in terms of how exactly we should specify the pronunciations for...