Text-to-Speech Synthesis

We saw in Chapter 13 that, despite the approximations in all the vocal-tract models concerned, the limiting factor in generating high-quality speech is not so much in converting the parameters into speech, but in knowing which parameters to use for a given synthesis specification. Determining these by hand-written rules can produce fairly intelligible speech, but the inherent complexities of speech seem to place an upper limit on the quality that can be achieved in this way. The various second-generation synthesis techniques explained in Chapter 14 solve the problem by simply measuring the values from real speech waveforms. Although this is successful to a certain extent, it is not a perfect solution. As we will see in Chapter 16, we can never collect enough data to cover all the effects we wish to synthesize, and often the coverage we have in the database is very uneven. Furthermore, the concatenative approach always limits us to recreating what we have recorded; in a sense all we are doing is reordering the original data.
An alternative is to use statistical, machine-learning techniques to infer the specification-to-parameter mapping from data. While this and the concatenative approach can both be described as data-driven, in the concatenative approach we are effectively memorising the data, whereas in the statistical approach we are attempting to learn the general properties of the data. Two advantages that arise from statistical models are that firstly we require orders of magnitude less memory to store the parameters of the...