Text-to-Speech Synthesis

The purpose of the join function is to tell us how well two units will join together when concatenated. In most approaches this function returns a cost, such that we usually talk about join costs. Other formulations are possible, however, including the join classifier, which returns true or false, and the join probability, which returns the probability that two units will be found in sequence.
Before considering the details of this, it is worth making some basic observations about concatenation in general. We discussed the issue of micro-concatenation in Section 14.7, which explained various simple techniques for joining waveforms without clicks and so on. We also use these techniques in unit selection, but now, because the variance at unit edges is considerably greater, we can t use the concatenation method of second-generation systems which solved this problem by using only very neutral, stable units. Instead, because the units have considerable variability, we now have to consider also the issue of macro-concatenation, often simply called the join problem.
Knowing whether two units will join together well is a complex matter. It is frequently the case, however, that we do in fact find a perfect join (that is, one that is completely inaudible). We stress this because it is often surprising to a newcomer to TTS that such manipulations can be performed so successfully. Of course, many times when we join arbitrary sections of speech the results are bad; the...