Programming Itanium-based Systems: Developing High Performance Applications for Intel's New Architecture

Up to this point in this chapter, it has been largely taken as a matter of faith that most memory loads can be satisfied by the L1 (Level 1) data cache in two cycles. This is by no means the rule in practice. It is important to keep in mind not only the latencies of the various levels of cache, but their respective sizes, as shown in Table 11.3.
| Cache level | Size | Integer Latency | FPLatency |
|---|---|---|---|
| L1 instruc. | 16 KB |
|
|
| L1 data | 16 KB | 2 |
|
| L2 | 96 KB | 6 | 9 |
| L3 | 4 MB | 22 | 24 |
| main memory | any | 176+ | 178+ |
At any given time, there is only 16 K of the highest speed data cache memory available. It is organized in cache lines of 32 bytes each. It is generally useful to try to organize data in such a way that if you reference one value in a cache line, you are likely to reference other data in the same cache line again in the near future. To the extent that data references are spread out somewhat randomly through memory, cache can be defeated.
Speculative loads allow the compiler to request data far enough in advance of need, usually, to mask even the two-cycle latency of the L1 cache. Similarly, in frequently executed loops using register rotation, even slow nine-cycle floating-point loads (the fastest available), can be masked so that they don t slow down computation.
When the compiler senses (as...