Programming Itanium-based Systems: Developing High Performance Applications for Intel's New Architecture

In order to create or verify optimal assembly language code on the Itanium processor, it is important to understand the role of instruction latencies. The latency of an instruction is the length of time that must elapse from the time that the instruction is issued to the time that its results can be used by another instruction. For most simple integer math operations, like add r32=r33,r34, the latency is a single cycle, so it is possible to use the results of many operations in the very next set of parallel instructions. This is not generally true, though, for floating-point operations or loads from memory, and there is an interesting exception, also, in the case of integer compare operations.
As previously described, up to six instructions can dispatch in parallel on the Itanium processor, but if any of the source operands of any of those six instructions has not completed its latency waiting period, all six instructions will be held up until the latency wait has completed. It is therefore very important, in well-planned code, to organize instructions in such a way that they won t have to wait for source registers to be ready.
When the result of an operation is ready to be used on the very next cycle, it is said to exhibit 1-cycle latency. In similar terms, Table 11.1 shows the latencies of some of the more important assembly language instructions.
| Floating Point | Cycles |
|---|---|
| multiply-and-add (fma) | 5 |
| convert integer... |