### Prefetching Getting ahead of the cache miss curve



### Prefetch

- Accesses often follow simple patterns
- needed, avoiding misses
- Danger of fetching too soon -- may be evicted
- Can evict useful data and force additional misses
- Can generate excessive memory traffic

### If the pattern can be identified, lines can be fetched before they are actually

# Wang ISCA 2003 Approach



## L2 Cache Prefetch

RAM access time more costly than L1 or L2 L2 has more idle time, allowing more costly prefetch Longer lines carry more information to work from L2 to RAM prefetch in a multiprocessor can impede shared memory access

## Scheduled Region Prefetch

- Aggressive prefetch of 4KB blocks on L2 miss
- Prefetch engine queues prefetch requests
- Prioritizer prevents interference with L2 misses
- More recent requests take priority
- Prefetch replaces LRU block
- Increases memory traffic by 180%

# Guided Region Prefetch

- of six
- Constrain size of prefetch using loop bounds info
- prefetches

Add pointer prefetching: assume any value that falls within the address range of the heap is a pointer and prefetch that address if not already in cache

Recursively prefetch "pointer" values found within a prefetched line to a depth

Use compiler hints to improve accuracy and reduce number of generated

# Compiler Hints

- Size: indicates range to prefetch
- line's values as indexes
- to prefetch
- Recursive pointer: the miss is part of a pointer chain

### Spatial: tags a load as likely to need spatial prefetch (leaves many loads untagged)

### Indirect: assume miss location is part of an index array and prefetch using the

### Pointer: the miss is getting data in a structure that contains pointers, so use those

# Results: Number of Hints

| Benchmark   | mem insts | spatial | pointer | recursive | ratio(%) | indirect |
|-------------|-----------|---------|---------|-----------|----------|----------|
| 164.gzip    | 1873      | 433     | 268     | 0         | 37.1     | 9        |
| 168.wupwise | 507       | 152     | 0       | 0         | 30.0     | 0        |
| 171.swim    | 250       | 115     | 0       | 0         | 46.0     | 0        |
| 172.mgrid   | 314       | 232     | 0       | 0         | 73.9     | 3        |
| 173.applu   | 1491      | 858     | 0       | 0         | 57.5     | 0        |
| 175.vpr     | 4230      | 1001    | 682     | 74        | 33.8     | 84       |
| 177.mesa    | 26777     | 4532    | 4419    | 76        | 32.8     | 9        |
| 179.art     | 1016      | 732     | 278     | 0         | 77.6     | 0        |
| 181.mcf     | 845       | 168     | 287     | 201       | 60.8     | 0        |
| 183.equake  | 1679      | 597     | 473     | 0         | 51.3     | 7        |
| 186.crafty  | 11702     | 1994    | 736     | 0         | 21.6     | 5        |
| 188.ammp    | 6271      | 1043    | 1158    | 0         | 33.2     | 5        |
| 197.parser  | 4090      | 915     | 932     | 1263      | 70.2     | 2        |
| 254.gap     | 29781     | 5102    | 11243   | 0         | 52.6     | 36       |
| 256.bzip2   | 698       | 279     | 59      | 0         | 48.3     | 14       |
| 300.twolf   | 12397     | 2080    | 2577    | 1398      | 45.1     | 38       |
| 301.apsi    | 3225      | 1001    | 0       | 0         | 31.0     | 0        |
| sphinx      | 6335      | 2211    | 1129    | 364       | 46.8     | 106      |

## Results: Pointer Prefetch



# Comparison: Stride, SRP



### Stride prefetcher has more expensive hardware









# Discussion

### Jain 18 Rethinking Belady's Algorithm to Accommodate Prefetching

# Belady's Algorithm (MIN)

- Originally for OS page replacement
- Peer into the future
- Replace page that will not be used for the longest time
- to do when there are multiple lines that are no longer needed
- Obviously not practical

Because page mapping is fully associative, there is no specification for what

# Terminology

- Demand the processor requests a location
- Demand miss the request misses, Demand hit the request hits
- Prefetch a separate predictor tries to anticipate a demand
- Accurate prefetch one that is demanded before eviction
- Prefetch friendly a successful prefetch following a demand
- Dead a demand load followed by an inaccurate prefetch

Shadowed — a prefetch followed by another without a demand between

## Basic Idea

- Optimizing demand misses ignores effect of prefetch
- the future
- Prefetches can be inaccurate, and they can come too early
- for demand fetches

Change policy to first evict the line that will be prefetched furthest in the future, then fall back to evicting the line with a demand request furthest in

The idea is to evict things that will be prefetched later to make more room





# Line Usage Intervals

| Usage Interval     | Interval Definition    |              | Cacheable by<br>Demand-MIN |
|--------------------|------------------------|--------------|----------------------------|
|                    |                        | by MIN       | Demand-Ivin v              |
| D-D                | Demand Reuse           | $\checkmark$ | $\checkmark$               |
| P-D                | Accurate Prefetch      | $\checkmark$ | $\checkmark$               |
| D-open             | Scan                   | ×            | $\times$                   |
| <i>P</i> -open     | Inaccurate Prefetch    | $\times$     | $\times$                   |
| $P-P_{accurate}$   | Shadowed Line          | $\checkmark$ | ×                          |
| $P-P_{inaccurate}$ |                        |              |                            |
| $D-P_{accurate}$   | Prefetch-Friendly Line | $\checkmark$ | ×                          |
|                    |                        |              |                            |
| $D-P_{inaccurate}$ | Dead Line              | $\checkmark$ | ×                          |
|                    |                        |              |                            |

UNLIKE MIN, DEMAND-MIN EVICTS ALL \*-P INTERVALS.

Avoids wasting space on prefetches that aren't needed or will recur



Table I

# Demand-VIN

- Still requires an oracle
- Produces a schedule that minimizes demand misses with prefetch
- However, it can produce more memory traffic because it evicts more prefetched values which then need to be prefetched again

### Examples



# Need Something Adjustable

- protected, otherwise fall back to MIN
- increase traffic with little improvement in demand hit rate
- prefetched values

Flex-MIN — Evict the line that is prefetched furthest in the future and is not

Protected lines — At the beginning of \*-P intervals, and if evicted would

Example: Evicting a long chain of short-duration prefetched values to keep a demand value in cache, vs. evicting a smaller number of long-duration

# Hawkeye cache

- Learns behaviors of references based on PC value
- Associate the learned prediction with the PC value
- Claim it can be implemented efficiently (it hasn't been implemented)
- 2K 5-bit counters per core

Approximates Optimal by looking for next reuse, later reuses are better eviction candidates, which would have been what OPT would predict

## Harmony

- Changes Hawkey to learn from Flex-MIN instead of MIN
- Distinguishes between \*-P and \*-D intervals
- duration prefetches to avoid extra memory traffic)
- friendly (Lines Evicted Per Demand Hit or LED)

Only allows \*-P caching if length is less than a threshold (keep the short)

Threshold is ratio of D-D intervals that his to \*-P intervals that are cache

Keeps four counters for Demand Miss and cache-friendly \*-P statistics

# Nethodology

- Champ-SIM
- memory simulation, perceptron branch predictor, not validated
- L1 uses next-line prefetch (49% accurate), L2 uses stride prefetch (63%)

Trace-based, 6-wide OoO, RoB, 3-levels cache, reasonably sophisticated

Use SimPoint to select traces, at least 1B instructions with 200M warmup

# Configuration

| L1 I-Cache   | 32 KI |
|--------------|-------|
| L1 D-Cache   | 32 KI |
| L2 Cache     | 256K  |
| LLC per core | 2MB,  |
| DRAM         | 13.5n |
|              | 40.5n |
|              | 800M  |
|              | and 1 |
| Single-core  | 2MB   |
| Four-core    | 8MB   |
| Eight-core   | 16ME  |
|              |       |

Table II BASELINE CONFIGURATION.

1B instructions won't give statistical significance on 2MB LLC accesses

B 8-way, 4-cycle latency B 8-way, 4-cycle latency B 8-way, 8-cycle latency , 16-way, 20-cycle latency is for row hits is for row misses Hz, 3.2 GB/s for single-core, 2.8 GB/s for multi-core

shared LLC shared LLC B shared LLC



## Results



Figure 11. Flex-MIN achieves a good tradeoff between MIN and Demand-MIN.







Figure 19. Harmony outperforms top submissions to the Cache Replacement Championship 2017. An older version of Harmony won the championship.

### But it won a contest!



Figure 19. championship.

# Discussion