Text-to-Speech Synthesis

The acoustic-space formulation (ASF) is a quite different way of defining the target function, so different in fact that often it is regarded as a completely different way of performing unit selection. This approach attempts a different solution to the problem of the specification items lacking an acoustic description. The formulation uses a partial-synthesis function to synthesize an acoustic representation from the linguistic features. Once this acoustic representation has been obtained, a search is used to find units that are acoustically similar to it. In general, the partial-synthesis function does not go all the way and generate an actual waveform; this would amount to a solution of the synthesis problem itself. Rather, an approximate representation is found, and synthesis is performed by using real units that are close to this.
We can also describe the ASF using the idea of perceptual spaces. The key idea of the ASF is that an existing, predefined, acoustic space is taken to be the perceptual space. Thus, rather than defining an abstract perceptual space, we use one that we can measure directly from speech data. Often the cepstral space is used, but in principle any space derivable from the waveform is possible. In the acoustic space, the distance between points is Euclidean, so that the only issue is how to place feature combinations within this space. In the case of feature combinations for which we have plenty of units, we can define a distribution for this...