









| Ca | iche                | E | am     | ple                                                    |
|----|---------------------|---|--------|--------------------------------------------------------|
|    | -block<br>nitial st |   | word/  | block, direct mapped                                   |
|    | Index               | V | Tag    | Data                                                   |
|    | 000                 | Ν |        |                                                        |
|    | 001                 | Ν |        |                                                        |
|    | 010                 | Ν |        |                                                        |
|    | 011                 | Ν |        |                                                        |
|    | 100                 | Ν |        |                                                        |
|    | 101                 | Ν |        |                                                        |
|    | 110                 | Ν |        |                                                        |
|    | 111                 | Ν |        |                                                        |
| MK | (°                  |   | Chapte | er 5 — Large and Fast: Exploiting Memory Hierarchy — 9 |

|            |        | am     |      | ,        |             |
|------------|--------|--------|------|----------|-------------|
| Word       | addr   | Binary | addr | Hit/miss | Cache block |
| 22         | 2      | 10 11  | 10   | Miss     | 110         |
| 000<br>001 | N<br>N |        |      |          |             |
| Index      | V      | Tag    | Dat  | а        |             |
| 001        | N      |        |      |          |             |
| 010        | N      |        |      |          |             |
| 100        | N      |        |      |          |             |
| 101        | Ν      |        |      |          |             |
| 110        | Υ      | 10     | Me   | m[10110] |             |
| 111        | N      |        |      |          |             |

| Word  | addr | Binary | addr | Hit/miss | Cache block |
|-------|------|--------|------|----------|-------------|
| 26    | 6    | 11 0   | 10   | Miss     | 010         |
| 000   | Ν    |        |      |          |             |
| Index | V    | Tag    | Da   | ta       |             |
|       |      |        |      |          |             |
| 001   | Ν    |        |      |          |             |
| 010   | Υ    | 11     | Me   | m[11010] |             |
| 011   | Ν    |        |      |          |             |
| 100   | Ν    |        |      |          |             |
| 101   | Ν    |        |      |          |             |
| 110   | Y    | 10     | Me   | m[10110] |             |
| 111   | Ν    |        |      |          |             |

|       |   | 1      |     |          |             |
|-------|---|--------|-----|----------|-------------|
| Word  |   | Binary |     | Hit/miss | Cache block |
| 22    | 2 | 10 11  | 10  | Hit      | 110         |
| 26    | 6 | 11 01  | 10  | Hit      | 010         |
| Index | V | Tag    | Dat | a        |             |
| 000   | Ν | Ű      |     |          |             |
| 001   | Ν |        |     |          |             |
| 010   | Y | 11     | Me  | m[11010] |             |
| 011   | Ν |        |     |          |             |
| 100   | Ν |        |     |          |             |
| 101   | Ν |        |     |          |             |
| 110   | Y | 10     | Me  | m[10110] |             |
| 111   | N |        |     |          |             |

| che   | E    | kam    | ple  | )        |             |
|-------|------|--------|------|----------|-------------|
| Word  | addr | Binary | addr | Hit/miss | Cache block |
| 16    | 6    | 10 0   |      | Miss     | 000         |
| 3     |      | 00 0   | )11  | Miss     | 011         |
| 16    | 6    | 10 0   | 00   | Hit      | 000         |
| Index | V    | Tag    | Dat  | a        |             |
| 000   | Υ    | 10     | Me   | m[10000] |             |
| 001   | Ν    |        |      |          |             |
| 010   | Y    | 11     | Me   | m[11010] |             |
| 011   | Υ    | 00     | Me   | m[00011] |             |
| 100   | Ν    |        |      |          |             |
| 101   | Ν    |        |      |          |             |
| 110   | Υ    | 10     | Me   | m[10110] |             |
| 111   | Ν    |        |      |          |             |

| Ca | che   | E    | amp       | le    | )              |                  |                     |
|----|-------|------|-----------|-------|----------------|------------------|---------------------|
|    | Worda | addr | Binary ac | ddr   | Hit/miss       | Cache block      |                     |
|    | 18    |      | 10 010    | )     | Miss           | 010              |                     |
|    | Index | v    | Tag       | Dat   | a              |                  | T                   |
|    | 000   | Y    | 10        | Me    | m[10000]       |                  | ł                   |
|    | 001   | Ν    |           |       |                |                  |                     |
|    | 010   | Y    | 10        | Ме    | m[10010]       |                  |                     |
|    | 011   | Υ    | 00        | Me    | m[00011]       |                  | Ī                   |
|    | 100   | Ν    |           |       |                |                  | Ī                   |
|    | 101   | Ν    |           |       |                |                  |                     |
|    | 110   | Υ    | 10        | Me    | m[10110]       |                  | Ī                   |
|    | 111   | Ν    |           |       |                |                  | I                   |
| MK | 8     |      | Chapter 5 | — Lar | ge and Fast: E | xploiting Memory | -<br>Hierarchy — 14 |







































































## The Memory Hierarchy

## The BIG Picture

 Common principles apply at all levels of the memory hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 60

Based on notions of caching

- At each level in the hierarchy
  - Block placement
  - Finding a block
  - Replacement on a miss
  - Write policy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

| Block Placement                                                                                                                                                  | Finding a                      | Block                                                  |                       |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|--------------------------------------------------------|-----------------------|
| <ul> <li>Determined by associativity</li> <li>Direct mapped (1-way associative)</li> </ul>                                                                       | Associativity<br>Direct mapped | Location method<br>Index                               | Tag comparison        |
| <ul> <li>One choice for placement</li> <li>n-way set associative</li> </ul>                                                                                      | n-way set<br>associative       | Set index, then search<br>entries within the set       | n<br>#ontrion         |
| <ul> <li>n choices within a set</li> </ul>                                                                                                                       | Fully associative              | Search all entries<br>Full lookup table                | #entries<br>0         |
| <ul> <li>Fully associative</li> <li>Any location</li> <li>Higher associativity reduces miss rate</li> <li>Increases complexity, cost, and access time</li> </ul> | Virtual memory                 | arisons to reduce cost<br>y<br>up makes full associati | vity feasible         |
| Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 61                                                                                                     |                                | Chapter 5 — Large and Fast: Explo                      | ting Memory Hierarchy |





| Design change          | Effect on miss rate         | Negative performance effect                                                                             |
|------------------------|-----------------------------|---------------------------------------------------------------------------------------------------------|
| Increase cache size    | Decrease capacity misses    | May increase access time                                                                                |
| Increase associativity | Decrease conflict<br>misses | May increase access time                                                                                |
| Increase block size    | Decrease compulsory misses  | Increases miss<br>penalty. For very large<br>block size, may<br>increase miss rate<br>due to pollution. |













| _ | <ul> <li>Su<br/>ad</li> </ul> | che Cohe<br>ppose two CPU<br>dress space<br>Write-through cach | cores sha          |                     |                     | §5.8 Parallelism and Me |
|---|-------------------------------|----------------------------------------------------------------|--------------------|---------------------|---------------------|-------------------------|
|   | Time<br>step                  | Event                                                          | CPU A's cache      | CPU B's cache       | Memory              | emory Hi                |
|   | 0                             |                                                                |                    |                     | 0                   | erarc                   |
|   | 1                             | CPU A reads X                                                  | 0                  |                     | 0                   | chies                   |
|   | 2                             | CPU B reads X                                                  | 0                  | 0                   | 0                   | Cac                     |
|   | 3                             | CPU A writes 1 to X                                            | 1                  | 0                   | 1                   | he Co                   |
|   | 46                            | Chapte                                                         | er 5 — Large and F | -ast: Exploiting Me | emory Hierarchy — 1 | pherence<br>75          |









| 2-Lev                | el TLB Organ                                                                                                                                                                 | ization                                                                                    |
|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
|                      | Intel Nehalem                                                                                                                                                                | AMD Opteron X4                                                                             |
| Virtual addr         | 48 bits                                                                                                                                                                      | 48 bits                                                                                    |
| Physical addr        | 44 bits                                                                                                                                                                      | 48 bits                                                                                    |
| Page size            | 4KB, 2/4MB                                                                                                                                                                   | 4KB, 2/4MB                                                                                 |
| L1 TLB<br>(per core) | L1 I-TLB: 128 entries for small<br>pages, 7 per thread (2×) for<br>large pages<br>L1 D-TLB: 64 entries for small<br>pages, 32 for large pages<br>Both 4-way, LRU replacement | L1 I-TLB: 48 entries<br>L1 D-TLB: 48 entries<br>Both fully associative, LRI<br>replacement |
| L2 TLB<br>(per core) | Single L2 TLB: 512 entries<br>4-way, LRU replacement                                                                                                                         | L2 I-TLB: 512 entries<br>L2 D-TLB: 512 entries<br>Both 4-way, round-robin L                |
| TLB misses           | Handled in hardware                                                                                                                                                          | Handled in hardware                                                                        |

| 3-Le                              | vel Cache Org                                                                                                                                                                                       | ganization                                                                                                                                                                                      |
|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                   | Intel Nehalem                                                                                                                                                                                       | AMD Opteron X4                                                                                                                                                                                  |
| L1 caches<br>(per core)           | L1 I-cache: 32KB, 64-byte<br>blocks, 4-way, approx LRU<br>replacement, hit time n/a<br>L1 D-cache: 32KB, 64-byte<br>blocks, 8-way, approx LRU<br>replacement, write-<br>back/allocate, hit time n/a | L1 I-cache: 32KB, 64-byte<br>blocks, 2-way, LRU<br>replacement, hit time 3 cycles<br>L1 D-cache: 32KB, 64-byte<br>blocks, 2-way, LRU<br>replacement, write-<br>back/allocate, hit time 9 cycles |
| L2 unified<br>cache<br>(per core) | 256KB, 64-byte blocks, 8-way,<br>approx LRU replacement, write-<br>back/allocate, hit time n/a                                                                                                      | 512KB, 64-byte blocks, 16-way<br>approx LRU replacement, write<br>back/allocate, hit time n/a                                                                                                   |
| L3 unified<br>cache<br>(shared)   | 8MB, 64-byte blocks, 16-way,<br>replacement n/a, write-<br>back/allocate, hit time n/a                                                                                                              | 2MB, 64-byte blocks, 32-way,<br>replace block shared by fewest<br>cores, write-back/allocate, hit<br>time 32 cycles                                                                             |
| n/a: data no                      | t available                                                                                                                                                                                         |                                                                                                                                                                                                 |





