Scientific Computing on Itanium-Based Systems

The elementary transcendental functions are basic building blocks in many branches of science and engineering. The previous chapter has discussed implementation methods for some of these functions. Those implementations are scalar implementations in that they are designed for short latency of function evaluations at single input arguments. In some situations, however, there is a need to evaluate certain elementary transcendental functions a large number of times. This may arise in data parallel situations, or simply through re-arrangements of a computation flow to gather all the necessary function calculations in one place. In a parallel architecture such as the Itanium architecture that also has special support for software pipelining, one would naturally ask if it is possible to have special algorithms or implementations that effect a vectorized calculation of these functions at a throughput much greater than that of the scalar versions, which are optimized for latency.
Software pipelining elementary transcendental function calculations are significantly more challenging than the linear algebra kernels discussed earlier in this chapter. There are two main reasons. First, the calculation involved in a complete normal calculation is rather long and irregular (compare to, for example, the repetition of fma in the DGEMM case); and second, accommodation of exceptional argument must be made.
The Vector Math Library (VML) provided in MKL [63] implemented vectorized versions of many of the transcendental functions. The design methodology taken there can be summarized as follows. For the computation of function at normal arguments
Choose an algorithm that...