Text-to-Speech Synthesis

The first feature we will examine is the base type, that is the type of units we will use in the synthesiser. The base type chosen in second-generation systems was often the diphone, since diphones often produced good joins. In unit selection, the greater variability in the units means that we can t always rely on diphones joining well, so the reasons for using diphones are somewhat less convincing. Indeed, from a survey of the literature, we see that almost every possible kind of base type has been used. In the following list we describe each type by its most common name, [2] cite some systems that use this base type, and give some indication of the number of each type, where we assume that we have N unique phones and M unique syllables in our pronunciation system.
frames Individual frames of speech, which can be combined in any order [204].
states Parts of phones, often determined by the alignment of HMM states [138, 140].
half-phones These are units that are half the size of a phone. Thus, they are either units that extend from the phone boundary to a mid point (which can be defined in a number of ways), or units that extend from this mid point to the end of the phone. There are 2 N different half-phone types [315].
diphones These units extend from the mid point of one...