Text-to-Speech Synthesis

At the time of writing, unit selection is judged the highest-quality synthesis technique. The intelligibility obtained with unit selection compares well with results attained using other techniques, and is sometimes better, sometimes worse. The naturalness with unit selection is generally considered much better and this is why the technique wins overall. Unit selection is not perfect, however, and a frequent criticism is that the quality can be inconsistent. This is to a degree inherent in the technique: occasionally completely unmodified originally contiguous sections of speech are generated, which of course will have the quality of pure recorded waveforms. On the other hand, sometimes there simply aren t any units that are good matches to the specification or join well. It is clear that in these cases synthesis will sound worse than the stretches of contiguous speech.
It is vital to note that the quality of the final speech in unit selection is heavily dependent on the quality of the database used; much more so than with other techniques. This makes assessments of individual algorithms quite difficult unless they are using the same data. The point here is that it is very difficult to conclude that, say, linguistic join costs are really better than acoustic join costs just because a system that uses the former sounds better than one that uses the latter. Only when such systems are trained and tested on the same data can such conclusions be drawn. This is not to say, however, that...