Discussion 15

Control Unit

The control unit is a finite state machine that takes as its inputs the IR, the status register (which is partly filled by the status output from the ALU), and the current major state of the cycle. Its rules are encoded either in random logic, a Programmable Logic Array (PLA), or Read-Only Memory (ROM), and its outputs are sent across the processor to each point requiring coordination or direction for the control unit.

For example the outputs needed for the portion of the instruction/data path shown in Discussion 13 are Jump/Branch/NextPC, IR Latch, Read Control, Load Control, ALU Function Select, Load/Reg-Reg, Reg R/W.

The ALU function select takes the instruction op code and translates it into a given function of the ALU (either one line per ALU function or a compact binary code for the function). The Jump/Branch/PC depends on the instruction type and in a RISC architecture these may be directly coded in the op code. Read control occurs at the start of an instruction cycle. IR latch and occurs at the end of the fetch state. Load control happens at the end of the data fetch state of a load instruction. Load Reg/Reg again depends on the op-code. Register R/W is in the start of the data fetch stage and at the write back stage of an operation. It thus depends on the major state and the instruction.

A CISC architecture typically uses a more complex control unit. As we’ve noted before, the IR is often multiple words, and the control unit has to look at different parts of the IR at different stages of execution. In fact, the entire IR may not be available at once, requiring interlocks with fetch logic to ensure the contents of the IR are valid.

There are many more control signals coming out of a CISC control unit, partly to control the more complex addressing logic, but also to directly connect to the many special purpose registers. In a RISC architecture, the registers are accessed uniformly in a block so a simple decoder in the register file can select the particular register. In a CISC architecture, there are restrictions on the particular registers that can be used by a given instruction and these are enforced by the control unit.

 

To begin the design of a control unit, we start by listing every control signal in the instruction/data path of the processor. This becomes a list of the control unit’s outputs. As input, it has the instruction register, any status information (such as branch flags, interrupts, etc.) from the processor, and a “major state” which simply keeps track of where we are in the execution of an instruction. We always begin an instruction with state 0, which corresponds to Fetch. During that state, the control unit outputs the necessary signals to route the contents of the PC to the memory address port, to select and clock the memory until it responds with data from that location, and then cause this data to be latched into the IR>

 

In a CISC architecture, the Fetch may only retrieve the first part of an instruction, and (depending on bits in the IR that are then decoded by the control unit) more words may need to be fetched. In a RISC architecture, a single Fetch retrieves a complete instruction, so we may proceed to the next major state, which is usually to begin fetching data from the registers, while we decode the instruction.

 

In a RISC architecture, “decoding an instruction” mainly means that the instruction type field determines what the control unit will do for the remainder of the instructions. If you think of the CU as a finite state machine, the bits in the type field select the next state following the decode.

 

In terms of a program’s logic, this is like selecting a branch in a Switch statement – each branch of the Switch contains the series of steps to be performed for one type of instruction. For example, after decoding a Jump instruction, the control unit outputs the signals required to combine the address portion of the instruction with the upper bits of the PC and load the result back into the PC. The CU then returns to the Fetch step. Thus, a Jump has three major states (Fetch, Decode, Complete). For a memory Load instruction, the CU first sends one of the selected register values (the address) to the address port of the memory (via a multiplexer) and signals the memory to fetch this location. When the memory returns the value, then the CU sends signals to the necessary multiplexer(s) and the register file so that the memory data goes over the Dest bus and is stored in the designated register. Thus, a Load has four major states (Fetch, Decode, Memory, Write Back).

 

So, for each type of instruction, and for each major state in each type of instruction, we look at the list of control signals and decide what value each signal must have. In some cases, the value doesn’t matter (e.g., if memory isn’t selected, it doesn’t matter whether it is set to read or write, because it simply won’t do anything in either case). You can think of this as a large 2-dimensional table indexed by instruction type and major state. Within each cell of the table is a list of the control signal and their values.

 

One last bit of control output that we’ve neglected is the control of the major state itself. This is usually a register, as shown above, that is input to the CU. But it also receives its next value on each clock from the CU. In the above example, the Jump proceeds from State 0 (Fetch) to State 1 (Decode) to State 3 (Complete) and then goes back to State 0. While a Load adds a State 4. In some designs, the state register also encodes the instruction type. Thus, it is really referring to the different states of the finite state machine (FSM) rather than the major steps of the instructions. So, for example, the FSM states for a Load might be the sequence 0, 1, 12, 13. The latter two distinguish Memory and Write Back from the Complete stage of the Jump. In other designs, we might see Jump going through states 0, 1, 2, and Load going through 0, 1, 2, 3, with the type field used to distinguish the different behavior of the latter states. This is all just a matter of using somewhat different ways of naming the same things. The important point is just that the CU has the inputs it needs to know what it is supposed to be doing on the present clock and what it will do next. In the CU design process, this translate to ensuring that one of the control signals on the list is the “next state” signal, and that we always specify this in every cell of our table.

 

Control for the Multicycle Datapath

The text shows how this datapath can be controlled by a finite state machine with just nine states.

Unlike the way this is drawn in the text, it is quite obvious here that each level of the controller's finite state machine corresponds to a clock cycle (or major state). This view clearly shows the commonality of the first two major states, prior to the decoding of the instruction. The third stage is a typical fanning out of the finite state machine to deal with the different cases of the instruction types.  Thus, we can see that the machine is arranged in a table where the rows are major states, and the columns are major instruction types. Because the new PC values have been computed, the Jump and Branch types can be finished early. The memory read takes the longest (5 major states).

Each of these states produces a set of control signals that cause the multiplexers to select the appropriate inputs, and the various registers to latch at the proper time. This is determined by looking at each device requiring control and determining for each state whether it is necessary to issue a signal on that particular control line.

Note that in the text, the circuitry controlled by the status from the ALU is shown external to the control for simplicity. But in a more general controller that must handle multiple conditions, this logic would not be separated. Because there are ten distinct states, and the book chooses to encode the instruction type in the major state (really the FSM state number) the state register for this machine must have four bits.

One common approach to implementing a controller is to have the major state be implemented by a counter register and within each of these states the different columns are represented by the state register (which in a RISC architecture could just be the instruction type code). The major state counter simply increments on each clock cycle, and when a column of states finishes early, its last state generates a reset signal to the major state counter.

Another issue that is not addressed here is what happens when there is a memory delay. We are still assuming that memory returns in a single cycle. The simplest case is to stall the finite state machine until the fetch is complete. This is done by having a memory wait signal that is input to the finite state machine. Each memory fetch state has a next state arc that loops back to itself whenever memory wait is asserted.

Another Simple Datapath Example: the PDP-8

The DEC PDP-8 is a frequently cited example of an almost trivial datapath. We'll quickly take a look at it and note some differences in the implementation approach. The PDP-8 is a 12-bit word machine with a single register (called the Accumulator). It thus falls into the class of single address computers. The majority of the instructions are specified by the upper 3 bits of a 12-bit word. There are thus 8 major instruction types:

0xxx   Logical AND location xxx with Accumulator

1xxx   Add location xxx to Accumulator

2xxx   Increment location xxx and skip next instruction if the result is 0

3xxx   Store Accumulator in location xxx and clear Accumulator

4xxx   Subroutine jump to xxx + 1 and store return address in xxx

5xxx   Unconditional jump to xxx

6xxx   I/O value in Accumulator with device according to xxx

7xxx   Subinstruction code specified by xxx

In the DEC scheme, the high order bit is number 0 and the low order bit is 11. Operations 0 through 5 use an addressing scheme in which bit 3 determines whether the address is direct or indirect, and bit 4 determines whether it refers to an address in the current 128-word "page" or the page that starts at address 0.

The processor organization is as follows, note that for the sake of further simplification, the Current Page mode logic has been omitted:

The processor has three major states: Fetch, Defer (Indirect address fetch), and Execute. Here is an example of an instruction execution:

1077 -- add location 77 directly to the accumulator

Major State 1 (Fetch)

Minor State 1 MAR <- PC

Minor State 2 MBR <- Mem(MAR), PC <- PC + 1

Minor State 3 Latch MBR into IR

Minor State 4 Decode Instruction = Add, Mode = Direct (no defer), Page = 0, Max St, No defer

Major State 2 (Execute)

Minor State 1 MAR <- 00000 + IR6..IR11

Minor State 2 MBR <- Mem(MAR), ALU Function = Add

Minor State 3 Acc <- ALU result, Max St

Now let's look at some of these control lines to see what logic expressions drive them:

No defer = (Major State = 1) AND (IR5 = 0)

Latch MBR = (Major State = 1) AND (Minor State = 3)

Increment PC = (Major State = 1) AND (Minor State = 2)

IR address/MBR = (IR0..IR2 = 4 or 5) AND (IR3 = 1)

Load PC = (Major State = 3) AND (Minor State = 3) AND (IR0..IR2 = 4 or 5)

Memory Data/ALU = (Major State = 3) AND (IR0..IR2 = 3)

MBR Latch = ((Major State = 1) AND (Minor State = 2)( OR )(Major State = 2) AND (Minor State = 2)( OR ((Major State = 3) AND (Minor State = 2) AND (IR0..IR2 < 4))

And so on. Essentially for each of the control signals we identify all of the conditions that could cause it to be asserted and add them to the expression for the given signal.

Microcode

In many CISC architectures the control unit can feed back to the major (and minor) states and has internal registers and a ROM. It can thus cause portions of an instruction to be extended or repeated as necessary. The minor states, registers. additional control logic, and ROM form a finite state machine called a microcode engine. In the microcode engine, the op-code from the IR becomes a jump address into the ROM. A micro-PC can be used to step through a series of fetches from the ROM starting at that point, with each fetch resulting in control signals being sent out and providing feedback to the major and minor state values.

An alternative to using a micro-PC is to have each instruction explicitly specify the address of its successor. Thus, one of the fields in the micro-op may be the address of the next instruction. This allows jumps to be used anywhere in the microcode with no time penalty -- consider that if a separate instruction had to be employed for a jump, it would add a cycle to the execution of the ISA instruction.

The microcode instruction set can contain a subroutine jump so that common sequences of control outputs can be reused. This is typically present in systems that employ a microPC rather than a next instruction field. There is typically just one subroutine return register, so nesting of subroutines is not allowed. Thus, the subroutine jump may be implemented as a normal jump with a signal issued that stores the current micro-PC value into the return register. And the subroutine ends with a Return operation rather than a jump to another location. The Return operation issues a signal that loads the return register back into the micro-PC. In a system with an explicit next instruction field, the current address plus one is stored into the return register and it is implicit that the next location holds the next instruction following the subroutine call. (An alternative is to have both a subroutine jump address and a next instruction address, and simply return to the same instruction, but this requires the instruction format to be larger than necessary for most operations.

Microinstruction Format Design

From the preceding discussion of how sequential execution, jumps, and calls are executed we can gather that the microinstruction needs to include control information for the microengine itself. But what else does it store?

In the simplest microinstruction formats, each bit represents a control signal that is sent out to the datapath. Thus, when a microinstruction is fetched, the bits in the instruction are connected to control wires and cause actions to occur in the datapath.

For example, in our PDP-8 example, there are 14 control signals that are driven by a single bit, and three others that each require multiple bits, for a total of 23 bits. Thus, we might use a microinstruction format such as:

Each of the first 20 bits of this microinstruction corresponds to a control signal in the PDP-8. For example, bit 0 might be the Halt, bit 1 the No defer, bit 2 the Max St., bit 3 Acc. Load, bit 4 Acc. clear, etc. This simple representation is effective but inefficient. For example, it uses 4 bits to select the ALU function but only one of those bits should be asserted at any time. Thus, we can save some memory by storing the number of the active bit and using an external "decoder" circuit to translate the two bits into the four lines that control the ALU.

In a simple design such as the PDP-8, it may seem that this (two bit) savings is trivial, but in a CISC ISA, the number of control signals can be quite large and it is important to minimize the number of bits in the microinstruction format. Every location in the microcode memory has to store the same set of bits, so any waste of bits is multiplied by the number of words.

Microinstruction format designs are often classified by the width of the word employed. One approach is to have a very wide word that contains all of the control signals necessary to drive the system. Such a design is referred to as a "horizontal" microinstruction format.

Another approach is to use a narrower word, with a sequence of microinstructions being required to drive all of the control signals. This is called a "vertical" microinstruction format.

At first glance, it appears that a vertical microinstruction is inherently slower -- it takes a sequence of operations to accomplish what the horizontal microinstruction can do in a single cycle. But consider that in many cases, individual control lines are asserted only in certain minor states. If all of these are grouped together by minor state, then they can reuse some of the bits of the microword by having their outputs first fed to a "demultiplexer" that steers them to the proper signal lines according to the current minor state (which may itself be part of the instruction). In effect the microinstruction format is using multiple instruction formats to reduce redundancy.

Horizontal microcode has been employed in massively parallel array processors where every processor in the system shares a single controller. Often the controller is itself a full-fledged computer, and so the microinstruction both contains traditional machine code for the controller itself as well as the control signals that are distributed to the processors that make up the array. A typical horizontal microinstruction for this type of machine is 128 bits wide (16 bytes). Thus, every effort is made to reduce the number of words required. It is important to note, however, that this is an unusual application of microcode.

One other problem with horizontal microcode is that it is difficult to drive such a large number of signals simultaneously. The switching of so many drivers at once can cause the power supply voltage to sag momentarily (the same as when your lights dim as you turn on a big appliance). This in turn causes noise to appear on signal lines that can cause erroneous behavior in other parts of the computer. Avoiding this requires careful circuit design and sometimes clever tricks, such as ensuring that the signals are asserted in a series of slightly offset time steps.

Why Microcode?

Of course, the whole reason for using microcode is to manage the complexity of a CISC ISA's control unit. Most RISC designs, even those that have fairly complex implementations, are still sufficiently regular that their control units can be directly constructed from a FSM built with combinational logic. (The advantage of using combinational logic is that it is easier to build a fast decoder with it than with a microcode ROM.)

However, for a CISC ISA, the speed decrease resulting from the use of microcode is often outweighted by the need to manage the complexity of controlling the architecture (a slow processor is, after all, more useful than a faster one that doesn't work). In a CISC architecture there may be a large number of instruction types, each with different fields referring to a wide range of registers that have asymmetrical functions, or referring to one or more memory operands with as many as 20 different addressing modes. (The DEC VAX, Intel 80X86/Pentium, and Intel iAPX 432 are prime examples).

A control unit for a CISC ISA can thus have to deal with instructions that involve tens and even hundreds of minor states. There would thus be thousands of logic expressions to generate the control signals. CAD tools can simplify and minimize these and even lay them out on the silicon automatically, but it is still a large block of irregular logic on the chip.

More importantly, if a mistake is discovered later (i.e., one of the logic expressions is wrong), then it may be necessary to resimplify the entire design and lay it out again, which could mean a redesign of the rest of the chip to accommodate a change in the size of the control unit. This is obviously a very costly error. Unfortunately, it is also common. Even with the best design and simulation tools, several commercial chips have gone into production with errors that were discovered later. An early 68000 design had a bug that would cause the processor to hang in certain cases, Intel shipped half a million Pentium processors with an error in the floating-point division instruction before it was caught, and has had bugs in earlier processors.

Since errors do occur, manufacturers of CISC processors use microcode to reduce the cost of correcting the errors (and to help simplify the initial design, which in itself helps to reduce errors). A bug-fix in a microcoded controller is just a matter of changing the ROM, which does not affect the size of the controller at all. It is a very low-cost correction.

In addition, a microcoded design is easier to enhance because unused op-codes can be turned into new machine instructions by simply extending the microcode. This simplicity of extension may be another factor that has lead to the increasingly complex ISAs produced by CISC manufacturers.

At one point, it was even thought that allowing the user to add to the microcode was a good idea. If a user has a particular operation that they want to accelerate, they can code it up as a new instruction in the microcode. Such machines were said to have "writeable control stores." The VAX 11/780, the first model in the VAX line, was one of the most widely sold of these machines. However, the feature was rarely used for two reasons: first, the compilers could not take advantage of the custom instructions so the user had to program in assembly language; second, the microcode is tied to the machine implementation (it refers to the particular control signals in the design) so it is not even portable to another model of the same architecture.

Nanocode

In some cases, such as the Motorola 68000, there is also a nanocode engine. The 68000 uses 544 17-bit words in its microengine and 336 68-bit words in its nanocode engine. It thus has 32,096 bits of ROM. If everything had been done with 68-bit words, it would have required 36,992 bits.

The M68000 microcode is very unusual in that the microcode implicitly calls the nanocode. Each microcode instruction causes a corresponding nanocode engine instruction to be fetched automatically. The nanocode bits are actually the control signals that get distributed across the machine. The microcode instructions thus have only to determine what the next instruction will be. They have two formats, one for an unconditional jump (perhaps just to the next location) or a conditional jump (two bits of the jump address are reserved for the result of the conditional test). This would seem to imply that there are as many nanocode instructions as microcode instructions. Yet we can see that there are 208 fewer. This is accomplished by carefully assigning the addresses so that common nanocode operations can have multiple microcode locations corresponding to them. The address space allows for 1024 instructions, and they are arranged so that if a bit (or several) is ignored, the same nanocode address is produced. Essentially the engineers mapped the microcode operations into locations so that certain of the address bits are "don't cares" and all of those locations are then mapped to the same nanocode address. The don't cares are achieved by removing selected transistors in the address decoders of the nanocode ROM.

Comparison of Some Microcode Engines

Motorola 68000: 544 17-bit microwords, plus 336 68-bit nanowords

DEC LSI-11: 2048 22-bit microwords

IBM 3033 Mainframe: 2048 108-bit microwords plus 2048 126-bit microwords

Texas Instruments 8800: 32K 128-bit microwords in user-programmable RAM

UMass/Hughes IUA-2: 64K 128-bit microwords in RAM

 


© Copyright 1995, 1996, 2001 Charles C. Weems Jr. All rights reserved.


Back to Chip Weems' home page.


Back to courses index page.


Back to Computer Science Department home page.