Home
Login / Signup
 di 

Blackfin Processor Core Architecture 2/3

Blackfin Processor Core Architecture

The Blackfin Core

The blackfin core itself consists of several units, there’s an arithmetic unit, which allows us to perform SIMD operations, where basically with a single instruction we can operate on multiple data streams. The Blackfin is also a load store architecture which means that the data that we’re going to be working with needs to be read from memory and then we calculate the results and store the results back into memory. There’s also an addressing unit which supports dual data fetch, so in a single cycle we can fetch two pieces of information, typically data and filter coefficients. We also have a sequencer which is dealing with program flow on the Blackfin, and there’s several register files, for instance for data as well as for addressing.

The Arithmetic
The first stop is the arithmetic unit. The arithmetic unit of course performs the arithmetic operations in the Blackfin. It has a dual 40 bit ALU, which performs 16, 32 or 40 bit operations, arithmetic and logical. We also have a pair of 16 by 16 bit multipliers, so we can do up to a pair of 16 bit multiplies at the same time and when combined with the 40 bit ALU accumulators we can do dual MACS in the same cycle. There’s also a barrel shifter which is used to perform shifts, rotates and other bit operations.

Data Registers
The data registers are broken up into sections , what we have here is the register file which consists of eight 32 bit registers from R0 to R7. They can be used to hold 32 bit values. Of course when we’re doing DSP operations, a lot of our data is in 16 bit form so they can hold pairs of 16 bit values as well. The R0.H, R0.L for instance correspond to the upper and lower halves of the R0 register which could hold two 16 bit values. We also have a pair of 40 bit accumulators, A0 and A1, which are typically used with multiply accumulate operations to provide extended precision storage for the intermediate products.

16bit ALU Operation
he graphic is the algebraic syntax for the assembly language. The syntax is quite intuitive and makes it very easy to understand what the instruction is doing. The first example is a single 16 bit operation, R6.H = R3.H + R2.L; so this is just the notation of the assembly language. What the operation means to take the upper half of R3 added to the lower half of R2 and place the result into the upper of R6. Next example is a dual 16 bit operation. R6=R2+|-R3. The first step is to read R2 and R3 as a pair of 32 bit registers, but that operator in between the +|- basically says it’s a dual 16 bit operation.Next is to take the lower half of R3 subtracting it from the lower half of R2 with the result going to the lower half of R6. In addition the operation is to take the upper half of R3 add it to the upper half of R2 and place the result in the upper half of R6. This effectively is a single cycle operation. The last example is a quad 16 bit operation where is just combining two 16 bit operations in the same instruction. There’s a coma here in between these two to indicate to do those four operations in the same cycle. This is done when the sum and difference between the pairs of 16 bit values are required in our two operands. Notice R0 and R1 are used on both sides of the coma in this example.

32Bit ALU Operations
Of course the Blackfin can also do 32 bit arithmetic operations as illustrated R6=R2+R3; Similar example to the previous ones, the difference is only one single operator so that tells the assembler that we’re doing a single 32 bit operation. In other words R2 and R3 contain 32 bit values not pairs of 16 bit values. We can also do a dual 32 bit operation where we take the sum and difference between, in this case, R1 and R2. These also are effectively single cycle operations.

Dual MAC Operations
We mentioned about the multipliers, in this particular example we have a pair of MAC operations going on at the same time, so A1-=R2.H*R3.H, A0+=R2.L*R3.L; Again two separate multiply accumulate operations, R2 and R3 are the 32 bit registers containing the input data and we can mix and match which half registers we use as our input operands. Again this is effectively a single cycle operation, in addition this could be combined with up to a dual data fetch.

Barrel Shifter
The Blackfin processor also has a barrel shifter, which enables shifting, rotating any number of bits within a half register, 32 or 40 bit register, all in a single cycle. It’s also used to perform individual bit operations on a 32 bit register in that register file, for instance we can individually set, clear, or toggle a bit without effecting any of the other bits in the register. We can also test to see if a particular bit has been cleared. There’s also instructions to allow field extract and field deposit. With this you specify the starting position of the field, how many bits long you’re interested in, and you can pull that field out from a register and return in the lower bits of another register. Used for bit stream manipulation.

8Bit Video ALUs
he Blackfin also has four 8 bit video ALUs, and the purpose of these devices is to allow you to perform up to four 8 bit operations in a single cycle. Because a lot of video operations typically deal with 8 bit values. Some examples here we can do a quad 8 bit add or subtract where we have four bites for each of the operands just add them or subtract them. We can do a quad 8 bit average, where we can average pairs of pixels or groups of four pixels. There’s also an SAA instruction, Subtract, Absolute, Accumulate. What it does is it allows you to take two sets of four pixels, take the absolute value of the difference between them and accumulate the results in the accumulators. In fact four 16 bit accumulators are used. This is used for motion estimation algorithms.

8Bit ALU Operations
All of these quad 8 bit ALU instructions still effectively take one cycle to execute, so they’re quite efficient. What we’re showing below is just the typical set up. We need our four byte operands, we start off with a two 32 bit register field, which provides 8 bytes, and we select any contiguous four bytes in a row from each of those fields. There’s one operand the second one being fed to the four 8 bit ALUs.

Additional Aithmetic Instructions
The Blackfin also supports a number of other additional instructions to help speed up the inner loop of many algorithms, such as bitwise XOR , so if you’re creating linear feedback shift registers, and you might be doing this if you’re doing CRC calculations, or involved with generating pseudo random number sequences, there’s instructions to basically speed up this process. Also the Bitwise XOR and Bit Stream Multiplexing are used to speed up operations such as convolutional encoders. There’s also add on sign, compare, select which can be used to facilitate the Viterbi decoders in your applications. There’s also support for being compliant with the IEEE 1180 standard for two dimensional eight by eight discrete cosine transforms. The add subtract with pre-scale up and down, we can add or subtract 32 bit values and then either multiply it by 16 or divide by 16 before returning 16 bit value. There’s also a vector search where you can go through a vector of data and a pair of 16 bit numbers at a time, search for the greatest or the least of that vector.

The Addressing Unit
The addressing unit is responsible for generating addresses for data fetches. There are two DAGs, or Data Address Generator arithmetic units which enable generation of independent 32 bit addresses. 32 bits allows us to reach anywhere within the Blackfin’s memory space, internally or externally. The other thing to make note of is it can generate up to two addresses, or perform up to two fetches at the same time.

Adress Registers
The address registers are contained within the addressing unit. There are six general purpose pointer registers, P0 through P5. These are just used for general purpose pointing. We initialize them with 32 bit value to point to any where in the Blackfin’s memory space, and they can be used for doing 8, 16, or 32 bit data fetches. Blackfins also have four sets of registers which are used for DSP style data fetches. This includes 16 and 32 bit fetches. DSP style means the types of operations that we typically use in DSP operations such as dual data fetches, circular buffer addressing, bit reversal, that would be done with these particular registers.

Adressing
Blackfins also have dedicated stack and frame pointers as shown here. The addressing unit also supports the following operations; Blackfin can do addressing only where specify index register, pointer register to point to some address in memory that we’re going to fetch from. It can also do a post modified type of operation where the register that you specify will be modified after the data fetch is done automatically in order to prepare it for the next data fetch operation. Circular buffering, for instance, is supported using this method. Blackfin can provide an address with an offset where a point register might be pointed at the base address of the table and we want to fetch with some known offset from that base address. When we do that the modify values do not apply, no pointer update is performed.

Circular Buffer Exemple
First of all, circular buffers are used a lot in DSP applications, where we have data coming in a continuous stream, but the filter only needs to have a small portion of that, so the circular buffers are typically the size of the filter that we’re using. What happens is we have a pointer, pointing to data that we’re fetching from the circular buffer. As we step through the data we need to do a boundary check when we update the pointer to make sure that we’re not stepping outside the bounds of the circular buffer. If we do step outside we need to wrap the pointer to be put back in bounds of the circular buffer. This is done without any additional overhead in hardware in the Blackfin. We can also place these buffers anywhere in memory without restriction due to the base address registers. In this particular example what we have is a buffer, which contains eleven 32 bit elements, and our goal is going to be to step through every fourth element and still stay inside the bounds of the circular buffer. What we’re going to do is initialize the base and the index register. The base register is the address of where this buffer is in memory, and the index register is the starting address of where we’re going to start fetching from. They’re both initialized to zero in this example.

The Sequencer
The last block in the core is the sequencer itself. The sequencer’s function is to generate addresses for fetching instructions. It uses a variety of registers in which to select what’s the next address that we’re going to fetch from. The Sequencer also is responsible for alignment. It aligns instructions as they’re fetched. The Sequencer always fetches 64 bits at a time from memory, but there’s different size op-codes, and what it does is it basically realigns 16, 32 or 64 bit op-codes before sending them to the rest of the execution pipe line.

Program Flow
The Sequencer is also responsible for handling events. Events could be interrupts or exceptions and we’ll talk about that shortly. Also any conditional instruction execution is also handled by the Sequencer. The Sequencer accommodates several types of program flow, linear flow is the most common where we just execute code linearly in a straight line fashion. Loop code is handled where you might have a sequence of instructions that you want to execute over and over again, some counted number of times. We can have a permanent redirection program flow with a jump type instruction where we just branch to some address in memory and just executing from that point on. Sub-routine types of operations where you call some subroutine , some address, and we execute and we return back to the address following the call instruction.

Sequencer Register
The Sequencer has a number of registers that are used in the control of program flow on the Blackfin. There’s an arithmetic status register, any time we do an arithmetic operation you could be affecting bits in here. AZ for zero, AN for negative for instance. The Sequencer uses these status bits for the conditional execution of the instructions. Blackfins also have a number of return address registers, RETS, RETI, RETX is shown, for sub-routine or any number of events. These hold the 32 bit return address for any flow interruption, either a call or some sort of interrupt. Blackfins also have two sets of hardware loop management registers; LC, LT, LB, so loop counter, address at the top of the loop, address at the bottom of the loop. They’re used to manage zero overhead looping. Zero overhead looping means once these registers are set up, there is no CPU cycles for decrementing the loop counter, checking to see if it’s reached zero, branching back to the top of the loop. This is all handled without any over head at all, just by simply initializing these registers. Two sets of these registers allow to have two nested levels of zero overhead looping.

Instruction Pipeline

The Blackfin also has an instruction pipeline and is responsible for the high clock speeds that are possible. The one thing to notice is the pipeline is fully interlocked. What this means is in the event of a data hazard, such as a couple of instructions going through, one instruction sets up a register that the next one is supposed to be using, if it’s not ready because of where it is in the pipeline, the Sequencer automatically inserts stall cycles and holds off the second instruction until the first one is complete. In this case here at cycle N+9 indicates that the pipeline is filled. Once the pipeline is filled every clock tick results in another instruction exiting the pipeline, or leaving the writeback stage. Effectively every clock tick, another instruction is executed, so this is where we get our one instruction per core clock cycle execution speed.

Blackfin Event Handling

Blackfin also handles events. Events can be categorized as either interrupts, typically hardware generated DMA transfers completed, timer has reached zero. You can also generate software interrupts. Other events would include exceptions. These could be either as a result in an error condition, if an overflow or a CPLB miss, or something like that, or the software requesting service. There’s a way for software to generate exceptions intentionally. The handling of events is split between the core event controller and the system interrupt controller. The core event controller and core clock domain, system interrupt controller The core event controller and core clock domain, system interrupt controllers and the SCLK domain. The core event controller has 16 levels associated with it and deals with the requests on a priority basis. The nine lowest levels of the core interrupt controller are basically available for the peripheral interrupt request to be mapped to. Every level has a 32 bit interrupt vector register associated with it. This can be initialized with the starting address of ISR so that when that particular level is triggered, that’s the address that we’re going to start fetching code from.

System and Core Interrupts
Blackfin also has the SIC, or the System Interrupt Controller portion of the Interrupt handler. The mapping of a particular peripheral interrupt request to a specific core interrupt level happens here. What this allows us to do is change the priority of a particular peripheral’s interrupt request. What we see here is an example using the BF533 interrupt handling mechanism. The left hand side is the system interrupt controller side, and here we have the core interrupt controller side. Again 16 levels at the core and what we’re showing is that these are prioritized. Level 0 or IVG0 has the highest priority in the system, level 15 has the lowest. In case multiple interrupts came in at the same time, the highest one would be given the nod.

Variable Instruction Lenghts
The Blackfin also has variable instruction lengths. There’s three main op-code lengths that Blackfin deals with. 16 bit instructions include most of the control type instructions, as well as data fetches, so these are all 16 bits long in order to improve code density. Another is 32 bit instructions. These include most of the control types of instructions that have immediate values, for instance loading register with immediate value or things of that sort. Most arithmetic instructions are also 32 bits in length. There’s also a multi issue 64 bit instruction, and this allows to combine a 32 bit op-code with a pair of 16 bit instructions. Typically an arithmetic operation with one or two data fetches. In the example there is a dual multiple accumulate operation in parallel with a dual 32 bit data fetch. This parallel pipe operator is used to tell the assembler to combine these operations at the same time. What happens is as a 64 bit op-code, these three instructions go through the pipeline together so they all get executed at the same time.

Instruction Packing
When code is compiled and linked, the linker places the code into the next available memory space so instructions aren’t padded out, for instance, to the largest possible instruction that’s handled on the Blackfin. What this means is there’s no wasted memory space. As a programmer you don’t have to worry about this, the sequencer has an alignment feature so as it reads the 64 bits of instruction from memory and performs the alignment. So it pulls out the individual 16, 32, and 64 bit op-codes. If they happen to straddle an octal address boundary it’ll also realign that before passing it on to the rest of the pipeline.

18Bit FIR FIlter Exemple
Blackfin is optimized for doing operations such as FIR filtering or FFTs. Here is an example of a 16 bit FIR filter. In fact there is an input data stream and two filters are applied to the same data stream. R0 and R1 are used as the two input registers. R0 contains two pieces of 16 bit input data, R1 contains 16 bit filter coefficients. The data is in the delay line, the filter coefficient is being fetched from the buffer, first filter coefficient will be used with the first two pieces of data and then another piece of data will be fetched while the next filter coefficient is used with two other pieces of data. The code example here at the bottom just shows two multiply accumulate instructions in parallel with either single or a dual data fetch. In addition to doing those two data fetches, the pointers are set up to point to the next pieces of data.

Additional Resources
Thank you for taking the time to view this presentation on ADI Blackfin processor. If you would like to learn more or go on to purchase some of these devices, you can either click the part list, or call our sales hotline. For more technical information you can either visit the ADI site – link shown – or if you would prefer to speak to someone live, please call our hotline number shown, or even use our ‘live chat’ online facility.

Read also: Blackfin Processor Core Architecture 1/3

Who's online

There are currently 0 users and 23 guests online.

Recent comments