The Software Optimization Cookbook: High-Performance Recipes for IA-32 Platforms, Second Edition

Loops are the most common sources of hotspots due solely to their repetitive nature; do anything enough times and it becomes a hotspot. A loop in itself is not necessarily a bottleneck and can actually improve performance in a few ways. First, loops reduce the number of stored instructions. Programs with fewer instructions require less memory, so less time is spent waiting for instructions to be fetched from main memory. Secondly, the Intel Pentium 4 processor caches decoded instructions, so when the same instruction is executed for a second time, the decode time is saved. The Intel Pentium M processor implements a comparable optimization that prevents re-fetching or re-decoding instructions in small loops. On the downside, loops also add some overhead. For example, the following code adds four integers together.
sum = 0;for (i=0; i<4; i++) { sum = sum + array[i];}These same four integers can be added together in a single assignment statement without using a loop.
sum = array[0] + array[1] + array[2] + array[3];
The loop version executes four additions, four increments, and several conditional branches with a typically mis-predicted branch that occurs when exiting the loop. In contrast, the single assignment statement merely executes three additions. Hence, for this example, the loop version typically runs slower than the single assignment statement. But, if the array had 10,000 elements instead of four, the loop version would outperform the fully expanded assignment statement.
So knowing when a loop should...