Embedded Media Processing

To attain maximum performance, an embedded processor should have independent bus structures to fetch data and instructions concurrently. This feature is a fundamental part of what's known as a Harvard architecture. Nomenclature aside, it doesn't take a Harvard grad to see that, without independent bus structures, every instruction or data fetch would be in the critical path of execution. Moreover, instructions would need to be fetched in small increments (most likely one at a time), because each data access the processor makes would need to utilize the same bus. In the end, performance would be horrible.
With separate buses for instructions and data, on the other hand, the processor can continue to fetch instructions while data is accessed simultaneously, saving valuable cycles. In addition, with separate buses a processor can pipeline its operations. This leads to increased performance (higher attainable core-clock speeds) because the processor can initiate future operations before it has finished its currently executing instructions.
So it's easy to understand why today's high-performance devices have more than one bus each for data and instructions. In Blackfin processors, for instance, each core can fetch up to 64 bits of instructions and two 32-bit data words in a single core-clock ( CCLK) cycle. Alternately, it can fetch 64 bits of instructions and one data word in the same cycle as it writes a data word.
There are many excellent references on the Harvard architecture, such as Reference 6 in the Appendix. However, because it is a straightforward...