Text-to-Speech Synthesis

This chapter contains a number of final topics, which have been left until last because they span many of the topics raised in the previous chapters.
Data-driven techniques have come to dominate nearly every aspect of text-to-speech in recent years. In addition to being affected by the algorithms themselves, the overall performance of a system is increasingly dominated by the quality of the databases that are used for training. In this section, we therefore examine the issues in database design, collection, labelling and use.
All algorithms are to some extent data-driven; even hand-written rules use some data , either explicitly or in a mental representation wherein the developer can imagine examples and how they should be dealt with. The difference between hand-written rules and data-driven techniques lies not in whether one uses data or not, but concerns how the data are used. Most data-driven techniques have an automatic training algorithm such that they can be trained on the data without the need for human intervention.
Unit selection is arguably the most data-driven technique because little or no processing is performed on the data, rather it is simply analysed, cut up and recombined in different sequences. As with other database techniques, the issue of coverage is vital, but in addition we have further issues concerning the actual recordings.
There is no firm agreement on how big a unit-selection system needs to be, but it is clear that, all other things being equal, the larger the better. As...