CmpSci 535 Discussion 7

Number Representations

As we saw in the discussion of Boolean logic, we can perform arithmetic on binary numbers using combinations of logical operations. Now we'll take a somewhat closer look at the specifics of arithmetic operations.

Recall from assembly language that binary numbers have a power of two associated with each of their digits. If we use this to number the digits as we write them down, we get a number scheme in which a 32-bit number has bit 31 on the left and bit 0 on the right.

In decimal notation we represent a negative number by writing a minus in front of a positive number. With binary numbers we do not have the convenience of such a notation, although we could declare one bit (say bit 31) to be a sign bit and the rest of the number to be the quantity. Such a notation is called signed-magnitude representation. Suppose we were using a four-bit word. Then 6 would be represented as


and -6 as


This is easy to read, but it has the disadvantage that if we add the numbers in this form, we do not get 0.




0100 with a 1 overflow.

so the result of 6 + (-6) appears to be 4 with overflow.

We don't particularly care about readability of binary numbers in the computer because we rarely look at them anyway (we write software to translate them into decimal notation). So we should really be more concerned with what representation is most useful for the computer. A system that allows us to use an adder to sum positive and negative numbers will simplify the design of the ALU. There are two such systems of representation and they both depend on the complement of a number.

The complement of a number in a given base can be defined as the difference between each digit of the number and the maximum digit value for the base. Thus, in base 10 the complement of 26 would be 73, because the maximum digit value is 9. In base two, the complement is easy to compute -- it is just the bitwise inverse of a number. Thus the complement of 11010 would be 00101.

This scheme is known as ones-complement and was used on machines designed by Seymour Cray at Control Data Corporation. They used ones complement because it made negating a number as fast and simple as passing it through a word-wide bank of inverters. It is also a symmetric number system in that, for any given word size, there are an equal number of positive and negative numbers. The problem is that one of those numbers is -0. So, a ones complement machine has to recognize the presence of all 1's in a word (with a giant AND gate) and invert the bits again whenever -0 appears. In ones complement, the above addition would be represented by:




1111 (-0) => 0000


Similarly 26 + 73 = 99 (-0) => 0

We can avoid the problem of -0 by using two's complement, in which we add 1 to the complement of a number to negate it.




10000 => 0000 with overflow

The 10's complement of 26 would be 74.


So 26 + 74 = 100 => 0 with overflow

Thus we never have negative 0 to worry about, but we do have a minor asymmetry -- there is one more negative number than positive number in our system. The unique 0 is essentially taken from the positive side of the number line in a two's complement system.

As we saw before, we can perform the +1 part of a two's complement by using the carry-in to the low-order bit of an adder, so there is almost no cost associated with this operation.

In two's complement, a high-order bit of 1 always signifies a negative number (in ones complement, we have to check for -0 when the high order bit is 1). Thus, the high order bit is usually called the sign bit.

If we need to convert a two's complement number to a larger word size, we simply extend the sign bit into the high order digits to produce a number of the correct form. That is, if the number is positive, all of the higher order bits in the wider word will be 0. If it is negative, they are all 1.


Shifting a word right or left (which is equivalent to multiplying or dividing by a power of 2) is used in multiplication and division and also to align data on byte or word boundaries. It can be implemented in several ways. The least expensive is with a shift register.

Basically a chain of D flip-flops is arranged with gates that connect their inputs to an external source or to each other.

We can make this bidirectional by adding a third input that takes its data from the flip-flop on the other side.

The drawback of this approach is that it requires each shift on n places to take n clock cycles. We can instead use a logic circuit that selects any of the input 32 bits to go to a given output bit, with all of the other output bits receiving a parallel selection.

For example, a shift of three to the right would establish the following pattern of connections:

Here we see one bit of a 32-bit shifter of this type:

Each bit would have this same arrangement of 32 AND gates feeding a32-bit OR gate. What is not as obvious from the logic diagram is that this results in 32 x 32 =1024 wires  leading from the inputs to the AND gates. An actual implementation would use several stages of gates to avoid the 32-input OR gate and the requirement for the inputs to drive 32 AND gates at once. This arrangement is clearly more expensive but it can compute any shift in just one cycle (or even less, since it is not a clocked circuit). It is called a barrel shifter. An alternative to supporting every shift distance, which is used for shifters that do data alignment (for example, loading a specific byte into the low-order byte of a register) is to provide just a few of the possible shift distances as required by the particular application.

Carry Lookahead Addition

The book does a nice job of showing how an adder/subtractor/AND/OR/less than can be built so it won't be repeated here. Of particular interest is to note how zero detection and overflow are detected in Figure 4.17.

We'll just briefly go over the operation of the carry lookahead since we've discussed the concept before. The basic idea behind its implementation is that some clever algebra has been used to simplify the logic functions by factoring out some common terms in each of the carry expressions.

These terms depend only on the bits being input at a given position -- not on any of the other positions. They are called

generate = A AND B (because if A and B are both 1, then a carry is generated regardless of the carry in to that position.

propagate = A OR B (because if either is a 1, then a carry in will be propagated to the next stage (and if they are both 1, the carry in doesn't matter -- it is an inclusive propagate)

So we have

c1 = g0 + (p0 c0)  that is, we get a carry if we have either a generate, or a propagate with a carry in

c2 = g1 + (p1 c1) = g1 + (p1 g0) + (p1 p0 c0)

c3 = g2 + (p2 c2) = g2 + (p2 g1) + p2 p1 g0) + (p2 p1 p0 c0)

c4 = g3 + (p3 c3) = g3 + (p3 g2) + (p3 p2 g1) + (p3 p2 p1 g0) + (p3 p2 p1 p0 c0)

As you can see, we can build each of the carry out computations from a circuit with just two stages of gates, but the n-th bit contains a circuit with an n+1 input OR gate and n AND gates who's input counts range from 2 to n+1.

For an 8-bit adder this works OK, but for 32 bits the fan-in is too high. So a hierarchical scheme is used instead -- it will usually consist of 4 or 8-bit carry lookahead adders and these feed another level of lookahead circuitry that produces the carry in to each of the blocks at the same time.


The basic idea behind multiplication is that we shift (leftwards) and add the multiplicand to a result register repeatedly. At each shift position we look at a corresponding digit of the multiplier and if it is 0 we skip the addition and if it is 1 we carry out the addition. The multiplier is thus placed in a register that shifts right, with the least significant bit being used to control the adder. Thus the multiplicand and the result are stored in double-length registers (the multiplicand to enable the shifting, and the result to accommodate the fact that it naturally grows to double the length as a result of multiplication).

More clever schemes recognize that as the data shifts out of the multiplier, space is left in its register that we can take advantage of to avoid having to use a double-length  register for the result. So some of the bits recirculate to fill the space being vacated by the departing bits of the multiplier as they shift across the register.

Booth's algorithm is another approach that tries to speed up multiplication by taking advantage of the patterns of bits in the operands. The algorithm notes that if the multiplier contains a string of ones (for example)


that the partial sum that results from this string is equal to the multiplicand times a 1 in the next higher bit minus itself times the 1 in the lowest order bit of the string.

That is 0000111111110000 X 1 = 0001000000000000 – 0000000000010000.

So if a number contains long strings of 1 bits, we can avoid all of the intermediate shifts and adds and replace them by one add and one subtract. The problem is that this algorithm is actually slower for patterns like


where an add and subtract would be done for each 1 in the multiplier.

We can build a really fast multiplier by creating a chain of adders that each take the multiplicand as one input (controlled by the appropriate bit of the multiplier) and take the output of the preceding stage as the other input.

This scheme produces a result in the time that it takes the data to propagate through all of the adder stages. It has as many adders as there are bits in the multiplier, and the adders are as wide as the number of bits in the multiplicand. Thus, for N-bit words, N-squared gates are required for a ripple carry adder based circuit, and a CLA based circuit is even more expensive.

One point to note about this approach is that if we put a register between each pair of adders, and clock the data through the register, then we can start a new multiply operation into this circuit with each clock cycle. After N clock ticks the first result emerges and subsequent multiplies emerge on each additional tick. Such an approach is called a pipelined multiplier and is useful in machines that are doing vector operations.

Wallace proposed an alternative scheme that uses fewer gates. He recognized that we don't have to complete all of the carry operations at each stage before passing a value on to the next stage. As long as the carries get added in before the final result is produced, we will have a valid sum of all of the partial products.

Thus, in Wallace's scheme, all of the partial products are generated in parallel by gating N copies of the N-bit the multiplicand by the bits of the multiplier into the inputs of a tree of "pseudo-adders". (Much as in the diagram above, except that the pipelined stages are replaced by the adder tree).

Each pseudo-adder is a bank of full adder circuits. But instead of propagating the carry output to an adjacent bit within the pseudo-adder, the carry is simply another output number:

A pseudo-adder block can be treated as a 3-input 2-output combinational circuit that produces two sums from three inputs. If we add the two sums together, then we get the actual binary sum of the three inputs.

Wallace showed how to construct a tree of these adders. The following diagram shows how 20 partial products could be summed together with pseudo-adders. At the final stage, a carry lookahead adder is used to sum the last carry and sum output pair.

Wallace tree adders can be constructed from larger adder blocks that take more inputs and produce more outputs -- all that is required of the outputs of a pseudo adder is that if you sum them up you get the sum of the inputs. Thus, for example, one might find a convenient Boolean function that takes 15 inputs and generates 4 outputs for which the sum is the sum of the 15 inputs.


The same basic approach can be used for division. Start with the dividend and divisor aligned. Shift the dividend left and subtract the divisor. If the result is positive, then the quotient gets a 1 and the remainder replaces the dividend. If the result is negative, then the quotient gets a 0 and the dividend is restored except that it is still shifted by one.

Given a fast multiplier, however, there are faster methods that can be applied based on a Newton-Raphson iterative solver. Flynn showed that a first-order Newton-Raphson formula can start with a 4-bit approximation and in just 3 iterations achieve 32-bit precision. Each of the iterations takes 2 multiplies and some shifts and complements.

The initial value can be obtained with a table lookup that uses the high order bits of the divisor as the address and produces the 4-bit approximation to the quotient. Thus, division of this form takes only about 6 times as long as multiplication. If the multiplier is like the Wallace Tree, it may be several times faster than the shift-and-subtract approach.

Elementary Functions

Computing sines, cosines, etc. in hardware, is actually included in many floating point units. One popular method that involves only simple hardware (and is used, for example, in most calculators) is the CORDIC (COordinate Rotation DIgital Computer) approach developed by Volder.

He observed that if one rotates a vector by a decreasing (in magnitude) series of positive or negative angles that the vector eventually converges on a desired angle. Further, if the angles are selected appropriately and at each stage either a positive or negative version of the rotation is applied (whichever gives the closer result) that the length of the resultant vector is a constant. Thus, if the final X and Y values are multiplied by an appropriate pair of constants to change the length of the vector into a unit, then as we know from trigonometry, the X and Y values of the unit vector at that angle are the values of the sine and cosine.

The CORDIC method uses a table of constants. For each iteration it compares the current angle to the desired angle, and either adds or subtracts the constants for that iteration to the angle, X, and Y values. The precision of the result increases one digit per iteration. At the end of this series of add/subtract operations, one multiply is required to produce the desired function.

Floating Point

As you probably recall from assembly language programming, the basic concept of floating point representation is that it provides a portion of a word that represents a mantissa, part that represents an exponent, and one bit for a sign. The sign of the exponent is usually subsumed by a bias value that must be subtracted from the stored value to obtain the correct exponent value. The advantage of the bias is that it makes the exponents all positive which saves an explicit bit for the sign and also means that the arithmetic (such as comparisons) does not have to treat positive and negative exponents differently. However, after adding two exponents (as in multiplication), the extra bias amount must be subtracted. Biased exponent representations also have the advantage that zero is represented by a word of all zeros.

The mantissa for a floating point number is usually normalized -- that is, it is shifted left until the first one bit is in the high order position. The exponent is adjusted by a corresponding amount to compensate for the shifting. Because a normalized representation always has a one in the leading position, we need not represent it explicitly. Thus, one more bit can be used to extend the precision of the mantissa.

Floating Point Operations

Addition involves aligning the mantissas of the two numbers (determining the difference between their exponents and then shifting the smaller of the two numbers right by that many positions, with the implicit one made explicit), then adding the mantissas and if there is an overflow renormalizing by shifting the result one to the right.

Subtraction is the same as addition except that one mantissa is subtracted from the other after alignment, and the renormalization may involve an arbitrary left shift of the result.

Multiplication does not require the mantissas to be aligned. They are multiplied, the exponents are added, and the result is renormalized if necessary (a shift of one position). Of course, the result is double the size of the original values, so it must be shortened either through truncation or rounding. (The other operations require this as well whenever a value is renormalized and bits of precision are lost.)

Division similarly involves no prealignment of the mantissas. When the division is performed, one exponent is subtracted from the other. Unlike multiplication, the mantissa result of the division can require arbitrary shifting to be renormalized.

IEEE Floating Point Format

The majority of computer systems today use the IEEE 754 standard that was first adopted in 1985. The adoption of this standard was a rare show of apparent cooperation among computer manufacturers. However, it was partly a matter of pragmatics (handling floating point properly is complicated, and adopting the standard saves both design cost and potentially embarrassing errors), partly competition (it became a positive marketing point to be able to claim IEEE compatibility), partly the effect that the microprocessor generation of manufacturers had a clean slate to start from (most of them did not have existing mainframe-based standards -- those that did were the ones like DEC and IBM who entered the microprocessor business late), and partly that the manufacturers uniformly underestimated the difficulty of implementing the IEEE standard (so they publicly jumped onto the bandwagon before they knew how much it would really cost).

Because the standard is well designed, its widespread adoption has been a very positive experience for the software community. However, even 15 years after it was introduced, there was very little language support to take advantage of the more sophisticated features of the standard, and some of the features are either unimplemented or very costly to employ on some machines.

The standard defines four formats for floating point numbers, of which two are commonly used (single and double). In single-precision numbers there is a one-bit sign, 23 bits in the mantissa (an implicit 24th bit is the leading 1 in all mantissas, which isn't stored), and the 8-bit exponent ranges from -126 to 127 with a bias of 127. In double-precision numbers there is the one-bit sign, 53 bits of precision in the mantissa and the exponent ranges from -1022 to 1023 with a bias of 1023.

The bias is a number that must be subtracted from the stored value of the exponent to get its value for purposes of computation. Thus, the minimum value that can appear in an exponent field of a proper single-precision floating point number is 1 and the maximum value is 254. This leaves 0 and 255 for representing special cases. In double-precision, the exponent can be 1 through 2046, with 0 and 2047 reserved for special cases.

The other two formats are "extended" versions of the two common representations. They assume that a machine will devote an entire word (32 or 64 bits) to the mantissa and at least 10 (single) or 14 (double) bits to the exponent field. The extended formats do not have defined biases.

Special Case Numbers

When the exponent of an IEEE FP number is all ones (255 or 1023) and the mantissa is all zeroes the number is explicitly equal to infinity (+ or -, depending on the sign bit). The standard requires arithmetic such as division by 0 to produce infinity as a result, and inversely, division by infinity to produce 0. Overflow can also produce an infinite result.

If the exponent is all ones and the mantissa is nonzero, then the value is called "Not a Number" or NaN. The NaN values can be used to represent results such as the square root of a negative. Arithmetic on NaN values should also produce NaN values as output so that if an improper floating point value enters a string of calculations, it cannot result in a valid floating point result.

When the exponent is zero, then the mantissa represents an unnormalized or "subnormal" number. These are numbers that are smaller than the number 1 x 2^-126. That is, the implicit leading 1 disappears and the number of significant digits in the mantissa can decrease to fewer than 24 bits. These numbers are useful when we consider that any two numbers with exponents of -126 will differ by an amount that is smaller than we can otherwise represent. So if we try to compare the numbers by subtracting them, we get an unnormalized number. In some systems, unnormalized numbers are not allowed and the difference would simply be zero. Thus, it would be possible to have two very small but different numbers that appear to be equal when we compare them.

While everyone agrees that having subnormal numbers enhances the accuracy of calculations and reduces programming effort, hardware designers find them to be a very difficult part of the standard to implement. Subnormals are a special case that occur relatively rarely, but they also require that all computations make the checks for the special combination of a zero exponent and a nonzero mantissa and then turn off the usual prealignment and renormalization hardware appropriately. Thus, implementing subnormals can both increase the cost of a floating point unit and result in slower computation overall.

In some processors, for example the PowerPC, the programmer can disable the use of subnormal numbers. In the PowerPC, when subnormals are disabled, a subnormal result generates a normalized value and an exception is raised so that the software can handle the underflow. Of course, the exception can be ignored if speed is more desirable than accuracy. When subnormals are enabled, no underflow exception is raised and the FPU takes care of all of the special handling.


The IEEE standard has four different rounding modes. The usual mode is to round to the nearest value, with a number that falls midway between two others being rounded to the nearest value with an even (zero) low-order digit. The other modes are round toward zero, round toward plus infinity, and round toward minus infinity.

Rounding is based on two additional bits of precision that are kept in intermediate stages of a floating point calculation. These are the guard and round bits. To see why two bits are needed, consider that adding two numbers can require an extra digit in order to know how to round them (as when one value is shifted by one position with respect to the other) and that the result of such an addition can also produce a carry out of the high order place. Thus, we need to be able to store a result that is two bits longer than normal.

One problem with the "round to even" mode is that it complicates the rounding process. For example, the result of a multiply might end with a long string of zeros and then a one. Without the one, the result would be rounded down to the nearest even value. But with the one, the result does not fall half way between the two rounded possibilities -- it actually falls closer to the odd number just above. Thus, the FPU has to keep track of whether there are any one bits beyond the guard and round -- it does this with a bit called the "sticky bit" because it is often implemented by watching the low order bits as they are shifted out to the right during normalization, and sticking to a one if any one bits are seen.

© Copyright 1995, 1996, 2001 Charles C. Weems Jr. All rights reserved.

Back to Chip Weems' home page.

Back to courses index page.

Back to Computer Science Department home page.