Text-to-Speech Synthesis

We saw in Chapter 13 that, while vocal-tract methods can often generate intelligible speech, they seem fundamentally limited in terms of generating natural-sounding speech. We saw that, in the case of formant synthesis, the main limitation is not so much in generating the speech from the parametric representation, but rather in generating these parameters from the input specification which was created by the text-analysis process. The mapping between the specification and the parameters is highly complex, and seems beyond what we can express in explicit human-derived rules, no matter how expert the rule designer. We face the same problems with articulatory synthesis and in addition have to deal with the facts that acquiring data is fundamentally difficult and improving naturalness often necessitates a considerable increase in complexity in the synthesiser.
A partial solution to the complexities of specifiction-to-parameter mapping is found in the classical LP technique whereby we bypassed the issue of generating of the vocal-tract parameters explicitly and instead measured them from data. The source parameters, however, were still specified by an explicit model, which was identified as the main source of the unnaturalness.
In this chapter we introduce a set of techniques that attempt to get around these limitations. In a way, these can be viewed as extensions of the classical LP technique in that they use a data-driven approach: the increase in quality, however, largely arises from the abandonment of the over-simplistic impulse/noise source model. These techniques are often collectively called second-generation synthesis systems, in...