Memory Hierarchy and Direct-Mapped Caching
There are many different technologies for storing data. In the old days, cathode ray tubes, acoustical delay lines, magnetic drums, magnetic cores, punched cards, and open-reel magnetic tape were popular. Today, most of these technologies have been replaced by semiconductor memory, magnetic disk, optical disk, and helical-scan magnetic tape cartridges.
Each of the technologies has a place in the market because it covers a range of access times with inversely proportional cost per unit, which in turn is proportional to density. CD-ROM filled a gap in access time between disk and tape that had emerged as disks got faster and tapes did not. The price of CD-ROM is basically determined by a market analysis that places it between disk and tape -- it is actually much cheaper to produce than the retail price would indicate. DVD-ROM is the next generation of this technology, which extends the storage capacity by an order of magnitude for cost that is only slightly above the CD. The slightly lower cost of the CD maintains a place in the market for uses that do not require the extra space, and which also retain backwards compatibility for a large installed base of readers.
There appears to be a wide gap between the main and secondary memory, but we are used to the fact that this gap exists, and both systems and languages are structured to work around it. That is, it is part of the programming model.
The memory hierarchy is viable because programs tend to exhibit a property known as locality. There are three basic forms of locality:
Temporal locality: Recently accessed items tend to be accessed again in the near future.
Spatial locality: Accesses are clustered in the address space.
Sequential locality: Instructions tend to be accessed in sequential memory locations.
Temporal locality tends to hold under all conditions for code (even parallel or recursive codes) although it may break down in a rule-based system. For data, it is most effective in array-based applications, and with scalar values such as constants that are used widely in the code.
The presence of spatial locality implies some temporal locality (i.e. if all accesses are within a small block of memory, then it is probable that many of them are being accessed repeatedly). However, temporal locality does not imply spatial locality (e.g. a linked-list program might repeatedly run through a short list who's elements are scattered).
Sequential locality holds for instructions and certain array access patterns.
The locality property provides the opportunity for us to use a small amount of very fast memory to effectively accelerate the majority of memory accesses. Because we know that, statistically, only a small amount of the entire memory space is being accessed at any given time, and values in that subset are being accessed repeatedly, we can copy those values from slower memory to the small fast memory. We thus end up with a memory system that can hold a large amount of information (in a large, low-cost memory) yet provide nearly the same access speed as would be obtained from having all of the memory be very fast and expensive.
The small and fast memory is, of course, the cache. Virtually all machines today use some form of cache. In the past a few supercomputers would eschew caches and provide large, fast, expensive memory. Why would this be done?
Because locality is entirely program-dependent, not all programs exhibit locality. For example, list-processing programs have logical locality that does not correspond to physical address locality. Even programs that process arrays and usually exhibit locality may have sections with only weak locality. When this happens, performance is reduced to a level corresponding to the time to access the larger and slower memory. Supercomputers, on the other hand, are built for ultimate speed and so their designers avoided caches until the 1990s.
By the late 1990s, processor clock rates were reaching levels that it was simply too hard to get signals to and from a large memory in time. Also, in order to get clock rates up to those levels, it was necessary to put more of the circuitry on a single chip so that it would be close together. Crossing the boundaries between chips slows the signals considerably, so that we actually needed to move the memory onto the same chip with the CPU. Of course, the chip could not hold the entire memory for the system, so the fast memory on the chip had to be a cache.
When a memory reference finds a value that it is seeking in the cache, the event is called a hit. If a reference fails to find a sought-after value in the cache, the event is called a miss. When a miss occurs, most memory systems automatically go to successively lower levels of the hierarchy to try to find the desired value.
The hit ratio is an important measure of the performance of a memory level and is the probability that a reference is to a value already in a given level of the hierarchy. The miss ratio is 1 - h and indicates the percentage of references to a level of the hierarchy that fail to find the desired data at that level (and must therefore look at a lower level).
The access frequency is the product of the hit ratio for the given level with the miss ratios of all higher levels.
The effective access time of the memory system is the sum of the access frequencies times their corresponding access times. This formula is overly simplistic, as it assumes there is some probability that you have to go to off-line storage (such as tape) to get data, for example.
Some systems do not automatically seek values at lower levels of the hierarchy. Instead, the program is responsible for explicitly loading and storing data in the cache, just as for registers. This sort of cache is said to be explicitly managed. When would such an organization be useful?
When it is possible to program the use of the cache and it is undesirable to trust the automatic system to cache the proper data. An example is in digital signal processing, where all the actions of processor are known in advance and variations in access time due to misses might cause the processor to fail to keep up with the incoming signal. The programmer of a DSP architecture can explicitly move data and instructions to and from the cache and thereby account for every instruction cycle. In a system with an automatic cache, the timing is not as controllable and hence not as predictable.
One of the implications of the differences in locality that we may encounter between data and instructions is that there is benefit in having separate instruction and data caches. The main benefit of separate caches, however, is that instructions and operands can be fetched simultaneously -- a design known as a Harvard architecture, after the Harvard Mark series of machines.
Cache Memory Design
Basic Cache Structures
A cache consists of lines of data, usually containing values from two or more consecutive addresses of main memory. Placing multiple words on a line helps the cache to exploit spatial and sequential locality. When we fetch any word on a line from main memory we fetch all of the surrounding words at the same time. So if the CPU subsequently requests any of those words, it does not have to suffer the time for another miss. Each line has associated with it a tag that stores the high order bits of the address for the data in the line.
When a memory reference occurs, the address from the processor is presented to the comparator in the cache. The high order bits are checked against the tags stored in the cache. If there is a match, then the line of data is read from the cache and the low order bits of the address select the appropriate word from the line to be passed to the CPU.
If there is no match among the tags, then a miss occurs and the address is sent to the main memory to fetch the line containing the desired word. The line is loaded into the cache and the desired value is simultaneously passed to the CPU. Hopefully, the next reference will exhibit locality and be to another word in the line just fetched, so that it is a hit.
At first glance, it might seem that the smart thing to do would be to fetch a longer line of data to increase the number of subsequent hits. However, it must be kept in mind that the longer the cache line, the longer it takes to read from main memory and thus the greater the time penalty for a miss.
If a program is working in a scattered locality, or in a very small locality, many of those fetches may wasted and so performance decreases. The best length for the line is usually determined by software simulations of caches with different line lengths, running a wide range of benchmark programs.
Also, if we make the lines longer, then we must reduce the number of lines in the cache. Fewer lines permit fewer distinct place in memory to reside in the cache. So we may begin to suffer more temporal misses, while we are better exploiting spatial locality. It the extreme, you can see that we would have one huge line in cache that would take a long time to load. And as soon as we wanted to reference a location in memory that is outside of the corresponding region, we would have to replace the entire content of the cache with region surrounding the new location.
The manner in which the comparison is implemented has a major effect on the cache. There are three basic implementation schemes consisting of two extremes and any design that falls between them. The extremes are fully associative and direct-mapped (or non-associative). Between these fall various set-associative implementations.
In a fully associative cache, there is a comparator for each tag, and any line can contain any block of memory. This allows a value from main memory to be placed anywhere in the cache. If there is an empty space anywhere in the cache, it can be filled by a miss.
A fully associative cache has the advantage that it allows complete utilization of the cache memory. Its disadvantage is that having one comparator per line is very expensive. In addition, the circuit that identifies the matching tag among all of the comparisons has a high fan-in degree and thus it is difficult to make it fast.
In a direct-mapped cache, a portion of the reference address specifies a block number or line number (different manufacturers use different terms for marketing and patent reasons) in addition to the tag field of the address. The block number specifies a particular line in the cache, and the tag field of the address is compared with the tag field attached to that line. That is, when we go to fetch a word, we use the block (line) number portion of the address to select the corresponding line in the cache, read out the tag that is stored with that line and compare it to the tag portion of the address.
If the tags are equal, then it means that this line of the cache currently contains the values stored in the block of memory that we’re attempting to access. Thus, there is a hit. We then read out the particular word within the cache line, as specified by the low-order bits of the address (called the offset).
In a direct mapped cache, there is a fixed mapping between addresses in memory and lines in the cache -- each location in memory maps to just one location in the cache. Because the cache is much smaller, there are many memory locations that map to each cache line. Even if the entire cache is empty except for one line, if the memory reference requests a word from a different block that maps to that same line, the value in the cache has to be displaced to main memory to make room for the requested value.
One advantage of direct mapping is that it requires just a single comparator, so it is very inexpensive to build. Direct mapping is also very fast, but it suffers from the problem that a collision between two main memory lines that map to the same line in cache (called a conflict) can cause the lines to thrash in and out, even when there is plenty of space left in the cache to hold them both. With increased cache size, the probability of this occurring is reduced and thus a few modern machines actually do use direct-mapped caches.
In order to characterize the activity in a cache, architects divide hits and misses up into different categories. When we get a hit in the cache, there are two potential reasons for the hit that are related to the kind of reuse of the data that we are experiencing. If the specific word has been previously accessed, then subsequent access are examples of temporal reuse and we call the hit a temporal hit. If the word is a hit but has not been previously accessed, then the hit came as a result of having fetched a word in the same block as the one we’re attempting to read. Thus, we are seeing spatial or sequential locality, and so we call this a spatial hit.
Misses are identified as either compulsory or conflict. A compulsory miss is one that fetches a memory block that has never been accessed before. Thus, it must be loaded into the cache. A conflict miss is one for which a line is being fetched that has been in cache before. Thus, the line has been evicted due to another line conflicting with it.
These labels are useful for analyzing the statistical behavior of the cache. For example, if a cache has few spatial hits, then it indicates that the line length is not optimal. If there are few temporal hits, and many conflict misses, it may indicate that the cache has too few lines.
When a line is displaced from the cache (evicted), then we have to consider that the data in the line may differ from the data in memory, because the CPU may have written values back into the cache. We don’t want to lose this data, so we need to write it back out to memory.
Many caches keep an extra bit with each line, called a dirty bit, that records whether any words in the line have changed. When a line is first loaded, the dirty bit is cleared. If the CPU writes into any word of the line, then the dirty bit is set. On eviction, the memory system checks the dirty bit and only writes back if the bit is set. This saves writing back unchanged lines.
When we write to the cache there are two possible situations to consider. The write is to a line that is in the cache (called a write hit), or to a line that is not in the cache (a write miss). On a write hit, we simply overwrite the word in the cache (using the line number and offset portions of the address to select the location).
On a write miss, we must evict a line from cache to make room for the data to be written. (In the next discussion we look at some alternatives to this.) If there is a dirty bit and it is set, then we must write back the values in the line before overwriting the data in the cache. We must also consider that the CPU has only written one word in the line, and the other words are invalid. We could add status bits to identify them as invalid, or we can simply load the line from memory before writing the new word.
Note that this means a write miss is twice as expensive as any other memory operation because we are writing from the cache to memory and then reading from memory back into the cache. It should also be noted that the CPU can proceed without waiting for the line to be read, and in many cases, there will be little or no actual delay in CPU due to the read. Caches are designed with various mechanisms to avoid this extra cost, such as deferring the read until the CPU attempts to read back a word that it hasn’t written in the line.
© Copyright 1995, 1996, 2001 Charles C. Weems Jr. All rights reserved.