Discussion 19
Memory Hierarchy and Direct-Mapped Caching
Memory Hierarchy
There are many different technologies for storing
data. In the old days, cathode ray tubes, acoustical delay lines, magnetic
drums, magnetic cores, punched cards, and open-reel magnetic tape were popular.
Today, most of these technologies have been replaced by semiconductor memory,
magnetic disk, optical disk, and helical-scan magnetic tape cartridges.
Each of the technologies has a place in the market
because it covers a range of access times with inversely proportional cost per
unit, which in turn is proportional to density. CD-ROM filled a gap in access
time between disk and tape that had emerged as disks got faster and tapes did
not. The price of CD-ROM is basically determined by a market analysis that
places it between disk and tape -- it is actually much cheaper to produce than
the retail price would indicate. DVD-ROM is the next generation of this
technology, which extends the storage capacity by an order of magnitude for
cost that is only slightly above the CD. The slightly lower cost of the CD
maintains a place in the market for uses that do not require the extra space,
and which also retain backwards compatibility for a large installed base of
readers.

There appears to be a wide gap between the main and
secondary memory, but we are used to the fact that this gap exists, and both
systems and languages are structured to work around it. That is, it is part of
the programming model.
The memory hierarchy is viable because programs tend
to exhibit a property known as locality. There are three basic forms of
locality:
Temporal locality: Recently accessed items tend to be
accessed again in the near future.
Spatial locality: Accesses are clustered in the
address space.
Sequential locality: Instructions tend to be accessed
in sequential memory locations.
Temporal locality tends to hold under all conditions
for code (even parallel or recursive codes) although it may break down in a
rule-based system. For data, it is most effective in array-based applications,
and with scalar values such as constants that are used widely in the code.
The presence of spatial locality implies some temporal
locality (i.e. if all accesses are within a small block of memory, then it is
probable that many of them are being accessed repeatedly). However, temporal
locality does not imply spatial locality (e.g. a linked-list program might
repeatedly run through a short list who's elements are scattered).
Sequential locality holds for instructions and certain
array access patterns.
The locality property provides the opportunity for us
to use a small amount of very fast memory to effectively accelerate the
majority of memory accesses. Because we know that, statistically, only a small
amount of the entire memory space is being accessed at any given time, and
values in that subset are being accessed repeatedly, we can copy those values
from slower memory to the small fast memory. We thus end up with a memory
system that can hold a large amount of information (in a large, low-cost
memory) yet provide nearly the same access speed as would be obtained from
having all of the memory be very fast and expensive.
The small and fast memory is, of course, the cache.
Virtually all machines today use some form of cache. In the past a few
supercomputers would eschew caches and provide large, fast, expensive memory.
Why would this be done?
Because locality is entirely program-dependent, not
all programs exhibit locality. For example, list-processing programs have
logical locality that does not correspond to physical address locality. Even
programs that process arrays and usually exhibit locality may have sections
with only weak locality. When this happens, performance is reduced to a level
corresponding to the time to access the larger and slower memory.
Supercomputers, on the other hand, are built for ultimate speed and so their
designers avoided caches until the 1990s.
By the late 1990s, processor clock rates were reaching
levels that it was simply too hard to get signals to and from a large memory in
time. Also, in order to get clock rates up to those levels, it was necessary to
put more of the circuitry on a single chip so that it would be close together.
Crossing the boundaries between chips slows the signals considerably, so that
we actually needed to move the memory onto the same chip with the CPU. Of
course, the chip could not hold the entire memory for the system, so the fast
memory on the chip had to be a cache.
When a memory reference finds a value that it is
seeking in the cache, the event is called a hit. If a reference fails to find a sought-after value in
the cache, the event is called a miss.
When a miss occurs, most memory systems automatically go to successively lower
levels of the hierarchy to try to find the desired value.
The hit ratio is an important measure of the performance of a memory level and is
the probability that a reference is to a value already in a given level of the
hierarchy. The miss ratio is 1 - h and indicates the percentage of references
to a level of the hierarchy that fail to find the desired data at that level
(and must therefore look at a lower level).
The access frequency is the product of the hit ratio
for the given level with the miss ratios of all higher levels.
The effective access time of the memory system is the
sum of the access frequencies times their corresponding access times. This
formula is overly simplistic, as it assumes there is some probability that you
have to go to off-line storage (such as tape) to get data, for example.
Some systems do not automatically seek values at lower
levels of the hierarchy. Instead, the program is responsible for explicitly
loading and storing data in the cache, just as for registers. This sort of
cache is said to be explicitly managed. When would such an organization be useful?
When it is possible to program the use of the cache
and it is undesirable to trust the automatic system to cache the proper data.
An example is in digital signal processing, where all the actions of processor
are known in advance and variations in access time due to misses might cause
the processor to fail to keep up with the incoming signal. The programmer of a
DSP architecture can explicitly move data and instructions to and from the
cache and thereby account for every instruction cycle. In a system with an
automatic cache, the timing is not as controllable and hence not as
predictable.
One of the implications of the differences in locality
that we may encounter between data and instructions is that there is benefit in
having separate instruction and data caches. The main benefit of separate
caches, however, is that instructions and operands can be fetched
simultaneously -- a design known as a Harvard architecture, after the Harvard
Mark series of machines.
Cache Memory Design
Basic Cache Structures
A cache consists of lines of data, usually containing
values from two or more consecutive addresses of main memory. Placing multiple
words on a line helps the cache to exploit spatial and sequential locality.
When we fetch any word on a line from main memory we fetch all of the
surrounding words at the same time. So if the CPU subsequently requests any of
those words, it does not have to suffer the time for another miss. Each line
has associated with it a tag that stores the high order bits of the address for
the data in the line.

When a memory reference occurs, the address from the
processor is presented to the comparator in the cache. The high order bits are
checked against the tags stored in the cache. If there is a match, then the
line of data is read from the cache and the low order bits of the address
select the appropriate word from the line to be passed to the CPU.
If there is no match among the tags, then a miss occurs
and the address is sent to the main memory to fetch the line containing the
desired word. The line is loaded into the cache and the desired value is
simultaneously passed to the CPU. Hopefully, the next reference will exhibit
locality and be to another word in the line just fetched, so that it is a hit.
At first glance, it might seem that the smart thing to
do would be to fetch a longer line of data to increase the number of subsequent
hits. However, it must be kept in mind that the longer the cache line, the
longer it takes to read from main memory and thus the greater the time penalty
for a miss.
If a program is working in a scattered locality, or in
a very small locality, many of those fetches may wasted and so performance
decreases. The best length for the line is usually determined by software
simulations of caches with different line lengths, running a wide range of
benchmark programs.
Also, if we make the lines longer, then we must reduce
the number of lines in the cache. Fewer lines permit fewer distinct place in
memory to reside in the cache. So we may begin to suffer more temporal misses,
while we are better exploiting spatial locality. It the extreme, you can see
that we would have one huge line in cache that would take a long time to load.
And as soon as we wanted to reference a location in memory that is outside of
the corresponding region, we would have to replace the entire content of the
cache with region surrounding the new location.
The manner in which the comparison is implemented has
a major effect on the cache. There are three basic implementation schemes
consisting of two extremes and any design that falls between them. The extremes
are fully associative and direct-mapped (or non-associative). Between these
fall various set-associative implementations.
In a fully associative cache, there is a comparator
for each tag, and any line can contain any block of memory. This allows a value
from main memory to be placed anywhere in the cache. If there is an empty space
anywhere in the cache, it can be filled by a miss.

A fully associative cache has the advantage that it
allows complete utilization of the cache memory. Its disadvantage is that
having one comparator per line is very expensive. In addition, the circuit that
identifies the matching tag among all of the comparisons has a high fan-in
degree and thus it is difficult to make it fast.
In a direct-mapped cache, a portion of the reference
address specifies a block number
or line number (different
manufacturers use different terms for marketing and patent reasons) in addition
to the tag field of the address. The block number specifies a particular line
in the cache, and the tag field of the address is compared with the tag field
attached to that line. That is, when we go to fetch a word, we use the block
(line) number portion of the address to select the corresponding line in the
cache, read out the tag that is stored with that line and compare it to the tag
portion of the address.
![]()
If the tags are equal, then it means that this line of
the cache currently contains the values stored in the block of memory that
we’re attempting to access. Thus, there is a hit. We then read out the
particular word within the cache line, as specified by the low-order bits of
the address (called the offset).
In a direct mapped cache, there is a fixed mapping between addresses in memory and lines in the cache -- each location in memory maps to just one location in the cache. Because the cache is much smaller, there are many memory locations that map to each cache line. Even if the entire cache is empty except for one line, if the memory reference requests a word from a different block that maps to that same line, the value in the cache has to be displaced to main memory to make room for the requested value.

One advantage of direct mapping is that it requires
just a single comparator, so it is very inexpensive to build. Direct mapping is
also very fast, but it suffers from the problem that a collision between two
main memory lines that map to the same line in cache (called a conflict) can cause the lines to thrash in and out, even when
there is plenty of space left in the cache to hold them both. With increased
cache size, the probability of this occurring is reduced and thus a few modern
machines actually do use direct-mapped caches.
In order to characterize the activity in a cache,
architects divide hits and misses up into different categories. When we get a
hit in the cache, there are two potential reasons for the hit that are related
to the kind of reuse of the data that we are experiencing. If the specific word
has been previously accessed, then subsequent access are examples of temporal
reuse and we call the hit a temporal hit. If the word is a hit but
has not been previously accessed, then the hit came as a result of having
fetched a word in the same block as the one we’re attempting to read.
Thus, we are seeing spatial or sequential locality, and so we call this a spatial
hit.
Misses are identified as either compulsory or conflict.
A compulsory miss is one that fetches a memory block that has never been
accessed before. Thus, it must be loaded into the cache. A conflict miss is one
for which a line is being fetched that has been in cache before. Thus, the line
has been evicted due to another line conflicting with it.
These labels are useful for analyzing the statistical
behavior of the cache. For example, if a cache has few spatial hits, then it
indicates that the line length is not optimal. If there are few temporal hits,
and many conflict misses, it may indicate that the cache has too few lines.
When a line is displaced from the cache (evicted),
then we have to consider that the data in the line may differ from the data in
memory, because the CPU may have written values back into the cache. We
don’t want to lose this data, so we need to write it back out to memory.
Many caches keep an extra bit with each line, called a
dirty bit, that records whether
any words in the line have changed. When a line is first loaded, the dirty bit
is cleared. If the CPU writes into any word of the line, then the dirty bit is
set. On eviction, the memory system checks the dirty bit and only writes back
if the bit is set. This saves writing back unchanged lines.
When we write to the cache there are two possible
situations to consider. The write is to a line that is in the cache (called a write
hit), or to a line that is not in the
cache (a write miss). On a write
hit, we simply overwrite the word in the cache (using the line number and
offset portions of the address to select the location).
On a write miss, we must evict a line from cache to
make room for the data to be written. (In the next discussion we look at some
alternatives to this.) If there is a dirty bit and it is set, then we must
write back the values in the line before overwriting the data in the cache. We
must also consider that the CPU has only written one word in the line, and the
other words are invalid. We could add status bits to identify them as invalid,
or we can simply load the line from memory before writing the new word.
Note that this means a write miss is twice as
expensive as any other memory operation because we are writing from the cache
to memory and then reading from memory back into the cache. It should also be
noted that the CPU can proceed without waiting for the line to be read, and in
many cases, there will be little or no actual delay in CPU due to the read.
Caches are designed with various mechanisms to avoid this extra cost, such as
deferring the read until the CPU attempts to read back a word that it
hasn’t written in the line.
© Copyright 1995, 1996, 2001 Charles C. Weems Jr.
All rights reserved.
Back to Chip Weems' home page.
Back to Computer Science Department home page.