Discussion 18

 

Branch Prediction

 

We’ve now seen the impact that branches can have on pipeline efficiency. We also noted that we can choose prediction policies that alter the default. For example, if an application has 70% of its branches taken, we can reduce the effective percentage to 30% by always assigning the branch target address to the PC instead of incrementing it to point to the next instruction.

 

We can be even more sophisticated, and assign this policy for each branch. That is, we could have the compiler identify branches that are likely taken or not-taken, and tag them with a bit in the opcode. Then, when we fetch a branch instruction that is likely to be taken, we put its target in the PC, and when we fetch one that is probably not taken, we increment the PC.

 

Some branches vary their behavior in ways that are predictable. For example, a loop control branch for a counting loop is taken on every iteration except the last. We could use a special counter register to hold this value, and as long as it isn’t 0, we predict the branch is taken. But when it reaches 0, we predict not-taken. In this way we can completely avoid mispredictions. Such a scheme is used in the PowerPC architecture. However, it has limited applicability because the C family of languages doesn’t have a pure counting loop. The termination condition of a DO loop can be anything, and the compiler would need to do an extensive analysis to ensure that it is safe to treat the loop as purely counting. Thus, the PowerPC register is used mainly with Fortran programs, and the Algol, Pascal, Modula, Ada family of languages.

 

All of these prediction schemes are referred to as “static,” because they are set prior to run-time. Even though the prediction can change in the last case, it is still predetermined. Static predictions work well for branches with consistent behaviors, but many branches have complex or erratic patterns of behavior. We need predictors that can learn these behaviors and use the gathered information to predict succeeding branch outcomes. Considerable effort has been spent in trying to identify behavior patterns within the compiler, but programming language semantics are too low-level to capture the relationships that govern branches in a large application. Some branches, such as loop controls, array bounds checks, and so on are easy to identify and to predict. Others are very difficult, as they may depend on values that are passed from other procedures via a long call chain. Some are impossible to predict from the code, because they depend completely on the input data (such as a range check on an input value).

 

To enhance our prediction capability, we need to examine the behavior of branches at run-time, and learn from this observation how to predict them. Predictors that base their decisions on learned behaviors are called “dynamic.”

 

Dynamic Predictors

 

The simplest dynamic predictor would be a single bit that remembers whether a branch was first taken or not. We could envision associating a bit with every word of the instruction cache, for example, that indicates (if the associated word is a branch) the direction that the branch took when it was first executed. While this sounds simple, it is in fact not especially effective or efficient. We would actually need another bit to remember whether the instruction has already been marked, to prevent us from continually changing its value. In addition, branches sometimes take one direction during initialization of a routine, and then go the opposite way for subsequent executions. So we would be giving the branch 100% misprediction.

 

In truth, the simplest dynamic predictor just keeps track of the last outcome of the branch. This is called a one-bit predictor. Typically, if it is 1, then the branch was taken the last time it was executed, and a 0 indicates the branch was not taken. The prediction is that the branch will do the same thing it did last time. So if its bit is set to 1, we predict it will again be taken.

 

This predictor has some serious limitations. Consider a loop conditional branch for the inner loop of a set of nested loops. Most of the time the branch is taken back to the top of the branch. But when the loop exits, it is mispredicted (as taken, but then is set to not taken). Then, when the outer loop(s) cause the loop to be re-entered, it is mispredicted again (because the exit case was the last outcome of the branch).

 

What we would like to remember is the dominant behavior of a branch. To do that, we need more than a single bit. However, it turns out that even adding one more bit is a significant improvement. Two bits can take the form of a saturating up-down counter, which counts up to 11 and remains there for subsequent up-counts, or counts down to 00 and remains at 00 for additional down counts. If we assign the values 11 and 10 to mean “predict taken,” then 00 an 01 mean “predict not taken.” If a branch is taken, we increment its counter and we decrement if it’s not taken. Let’s look at an example. The following counter traces the outcomes of a branch, showing its predictions, their correctness, and the state of the counter.

 

Outcome

Counter Before

Prediction

Correctness

T

00

N

Incorrect

T

01

N

Incorrect

T

10

T

Correct

T

11

T

Correct

T

11

T

Correct

N

11

T

Incorrect

T

10

T

Correct

T

11

T

Correct

T

11

T

Correct

T

11

T

Correct

N

11

T

Incorrect

T

10

T

Correct

N

11

T

Incorrect

N

10

T

Incorrect

N

01

N

Correct

 

We can see that this predictor avoids the problem of mispredicting twice for a single variation in behavior. When behavior does shift, as at the end of this trace, the predictor produces two mispredictions before switching to the new behavior. It is indeed possible to construct a degenerate code so that this predictor is always wrong: a branch is taken twice and then alternates between taken and not taken. However, the probability of this particular behavior is lower than that of ones that the predictor captures.

 

Researchers have explored using more bits in the counter, but have found that the improvement in avoiding misprediction due to transient variations in behavior is typically countered by the increased number of mispredictions that occur when there is a genuine shift in behavior. Two bits seems to be “about right.”

 

Branch History Table

 

In the discussion so far, we’ve assumed that the prediction counters are stored in the cache beside each word. Such an implementation is conceptually simple but wasteful of resources. Conditional branches appear about every 4 to 6 instructions. Thus, only 1/4 to 1/6 of our two-bit counters are being used. The rest sit idle. Most systems group 4 or 8 words together in cache to form a “line.” So, some systems attach the counters on a line-by-line basis to better match the ratio of counters to branches. But what happens when a line contains two or more branches? Then they share the counter. Such sharing is called “aliasing.”

 

Aliasing comes in two flavors: constructive and destructive. In constructive aliasing, the branches have similar behaviors, and so they reinforce each other’s predictions. The counter stays saturated at one end or the other. Even when one branch is suffering more transient behavior, the other branch tends to steady the predictions by remaining consistent (of course, if both branches start acting up, then predictions become less accurate). In destructive aliasing, the branches have opposing behaviors and cause constant misprediction. Destructive aliasing is more common than constructive aliasing, but in this arrangement, it is rare for predictors to be aliased to begin with, so it is a penalty we pay for lower hardware cost.

 

We can extend this tradeoff of aliasing for space even further by removing the counters from the cache entirely and placing them in a separate table, called a Branch History Table (BHT), that contains fewer entries than the number of lines in the cache. Because the BHT is smaller, we need a way to map instructions into it. For example, we can use the low-order address bits of a branch as an index into the table. That way, as we generate the address to fetch the branch from the cache, we can use the same bits to access the BHT and look up the current prediction. Of course, the smaller we make the BHT, the more likely it is that branches will be aliased. But, for high performance programs (such a scientific Fortran where branches may be sparse), it turns out that we can often get away with a smaller BHT. Of course, smaller also means faster, and we must be able to access the prediction quickly in order for it to be useful. (A late prediction is just old news.)

 

Path-Based Prediction

 

Branches that have complex patterns of behavior are sometimes more predictable if we know the context under which they are operating. For example, if one branch sets a flag, and another branch later tests the flag, we could perfectly predict the second branch by observing the first one’s outcome. Extending this idea, we find branches that exhibit a mix of behaviors that depend on what happened prior to their execution. In fact, the path through which the CPU arrives at a given branch can often be used to decompose a complex behavior into a set of simple and easily predicted behaviors. We call designs that use this information, path-based predictors. The encoding of the path is known as “global history,” in contrast to the “local history” that a two-bit counter records about a single branch.

 

All kinds of complex schemes may leap to mind for encoding the path leading to a given branch. But it must be simple an inexpensive to implemented in hardware. So we use a bit-vector containing the outcomes of recent branches. These are kept in a shift register, called the Global History Register (GHR) and each time a branch executes, its outcome (1 for taken, 0 for not) is shifted into one end and the outcome of an earlier branch shifts out the other end. The GHR may be as small as 8 bits or as large as 16 bits. The size is determined by the number of entries in the BHT, because the GHR is used as an index into the BHT. Thus, when a branch occurs in this scheme, instead of using its address (which would map all of its behaviors to a single counter) we use the GHR. That way, each of its behaviors gets a different counter.

 

It might seem that this would result in each branch using every counter in the BHT. But many branches are reached through just a few paths. So, for a given branch, only a few patterns may appear in the GHR, and it would use only as many counters as it needs. However, it is true that branches tend to cluster in the space of GHR patterns. So to avoid hot-spots in the BHT that result in unnecessary aliasing, we add a hash function to the indexing. Again, we could look at complex, optimal hashes. But in hardware we want simple and cheap solutions, so we merely XOR the GHR with the lower address bits of the branch, and it turns out that this is cost effective.

 

This arrangement of a GHR with a BHT, is called a “two-level branch predictor”.  People have proposed more complex schemes, but this one is inexpensive and works well. The version that uses the XOR hash is known as a G-share predictor (the name that it was given when first published), which stands for “global history register, shared BHT.” Other designs have proposed a BHT for each pattern in the GHR, which completely eliminates aliasing, but is also too costly to implement.


© Copyright 1995, 1996, 2001 Charles C. Weems Jr. All rights reserved.


Back to Chip Weems' home page.


Back to courses index page.


Back to Computer Science Department home page.