|
Architectural requirements for a DSP processer
The best way to understand the requirements is to examine typical DSP
algorithms and identify how their compositional requirements have
influenced the architectures of DSP processor. Let us consider one of the
most common processing tasks the finite impulse response filter.
For each tap of the filter a data sample is multiplied
by a filter coefficient with result added to a running sum for all of the
taps .Hence the main component of the FIR filter is dot product: multiply
and add .These options are not unique to the FIR filter algorithm; in fact
multiplication is one of the most common operation performed in signal
processing -convolution, IIR filtering and Fourier transform also involve
heavy use of multiply -accumulate operation. Originally, microprocessors
implemented multiplication by a series of shift and add operation, each of
which consumes one or more clock cycle .First a DSP processor requires a
hardware which can multiply in one single cycle. Most of the DSP algorithm
require a multiply and accumulate unit (MAC).
In comparison to other type of computing tasks, DSP
application typically have very high computational requirements since they
often must execute DSP algorithms in real time on lengthy segments
,therefore parallel operation of several independent execution units is a
must -for example in addition to MAC unit an ALU and shifter is also
required .
Executing a MAC in every clock cycle requires more than
just single cycle MAC unit. It also requires the ability to fetch the MAC
instruction, a data sample, and a filter coefficient from a memory in a
single cycle. Hence good DSP performance requires high memory band
width-higher than that of general microprocessors, which had one single
bus connection to memory and could only make one access per cycle. The
most common approach was to use two or more separate banks of memory, each
of which was accessed by its own bus and could be written or read in a
single cycle. This means programs are stored in a memory and data in
another .With this arrangement, the processor could fetch and a data
operand in parallel in every cycle .since many DSP algorithms consume two
data operands per instruction a further optimization commonly used is to
include small bank of RAM near the processor core that is used as an
instruction cache. When a small group of instruction is executed
repeatedly, the cache is loaded with those instructions, freeing the
instruction bus to be used for data fetches instead of instruction fetches
-thus enabling the processor to execute a MAC in a single cycle.
|