Text-to-Speech Synthesis

Recall that in the Hunt and Black algorithm (Equations (16.2) and (16.3)) the total cost of one sequence of units from the database U = ? u 1 , u 2 , , u N ? with the specification S = ? s 1 , s 2 , , s N ? is given by
| (16.6) | |
This calculates the cost for just one possible sequence of units. Our goal is of course to find the single sequence of units in the database which gives the lowest cost:
| (16.7) | |
and the issue of unit-selection searching concerns how to find this one best sequence of units from all the possible sequences.
To give some idea of the scale of this, consider a typical case where we have, say, 20 words in the input sentence, which gives rise to 100 diphone items in the specification. If we have 100 units for each diphone, this means that there are 100 100 = 10 200 unique possible sequences of diphones. If we calculated Equation (16.2) for every sequence this would still take an astonishing length of time regardless of computer speed. [4] Luckily, we can employ the well-known dynamic-programming algorithm, a speciality of which is the Viterbi algorithm described in Section 15.1.5 This algorithm operates in time M 2 N, where M is the number of units and N is the length of the specification sequence. This algorithm takes linear time...