EECC550 Exam Review

4 out of 6 questions

1. Multicycle CPU performance vs. Pipelined CPU performance

2. Given MIPS code, MIPS pipeline (similar to questions 2, 3 of HW#4)
   - Performance of code as is on a given CPU
   - Schedule the code to reduce stalls + resulting performance

3. Cache Operation: Given a series of word memory address references, cache capacity and organization: (similar to chapter 7 exercises 9, 10 of HW #5)
   - Find Hits/misses, Hit rate, Final content of cache

4. Pipelined CPU performance with non-ideal memory and unified or split cache
   - Find AMAT, CPI, performance …

5. For a cache level with given characteristics find:
   - Address fields, mapping function, storage requirements etc.

6. Performance evaluation of non-ideal pipelined CPUs using non ideal memory:
   - Desired performance maybe given: Find missing parameter
MIPS CPU Design: Multi-Cycle Datapath (Textbook Version)

CPI: R-Type = 4, Load = 5, Store 4, Jump/Branch = 3
Only one instruction being processed in datapath

How to lower CPI further without increasing CPU clock cycle time, C?

\[ T = I \times CPI \times C \]

Processing an instruction starts when the previous instruction is completed
## Operations In Each Cycle

<table>
<thead>
<tr>
<th></th>
<th>R-Type</th>
<th>Load</th>
<th>Store</th>
<th>Branch</th>
<th>Jump</th>
</tr>
</thead>
</table>
| **IF** Instruction Fetch | IR $\leftarrow$ Mem[PC]  
PC $\leftarrow$ PC + 4 | IR $\leftarrow$ Mem[PC]  
PC $\leftarrow$ PC + 4 | IR $\leftarrow$ Mem[PC]  
PC $\leftarrow$ PC + 4 | IR $\leftarrow$ Mem[PC]  
PC $\leftarrow$ PC + 4 | IR $\leftarrow$ Mem[PC]  
PC $\leftarrow$ PC + 4 |
| **ID** Instruction Decode | A $\leftarrow$ R[rs]  
B $\leftarrow$ R[rt]  
ALUout $\leftarrow$ PC + (SignExt(imm16) x4) | A $\leftarrow$ R[rs]  
B $\leftarrow$ R[rt]  
ALUout $\leftarrow$ PC + (SignExt(imm16) x4) | A $\leftarrow$ R[rs]  
B $\leftarrow$ R[rt]  
ALUout $\leftarrow$ PC + (SignExt(imm16) x4) | A $\leftarrow$ R[rs]  
B $\leftarrow$ R[rt]  
ALUout $\leftarrow$ PC + (SignExt(imm16) x4) | A $\leftarrow$ R[rs]  
B $\leftarrow$ R[rt]  
ALUout $\leftarrow$ PC + (SignExt(imm16) x4) |
| **EX** Execution | ALUout $\leftarrow$ A funct B  
ALUout $\leftarrow$ A + SignEx(Imm16) | ALUout $\leftarrow$ A + SignEx(Imm16) | ALUout $\leftarrow$ A + SignEx(Imm16) | Zero $\leftarrow$ A - B  
Zero: PC $\leftarrow$ ALUout | PC $\leftarrow$ Jump Address |
| **MEM** Memory | | | | | |
| **WB** Write Back | R[rd] $\leftarrow$ ALUout  
R[rt] $\leftarrow$ M | | | | |

### Reducing the CPI by combining cycles increases CPU clock cycle

*Instruction Fetch (IF) & Instruction Decode (ID) cycles are common for all instructions*
Multi-cycle Datapath Instruction CPI

• R-Type/Immediate: Require four cycles, CPI = 4
  – IF, ID, EX, WB

• Loads: Require five cycles, CPI = 5
  – IF, ID, EX, MEM, WB

• Stores: Require four cycles, CPI = 4
  – IF, ID, EX, MEM

• Branches: Require three cycles, CPI = 3
  – IF, ID, EX

• Average program \( 3 \leq CPI \leq 5 \) depending on program profile (instruction mix).

Processing an instruction starts when the previous instruction is completed.
MIPS Multi-cycle Datapath Performance Evaluation

What is the average CPI?
- State diagram gives CPI for each instruction type.
- Workload (program) below gives frequency of each type.

<table>
<thead>
<tr>
<th>Type</th>
<th>CPI_i for type</th>
<th>Frequency</th>
<th>CPI_i x freq_i</th>
</tr>
</thead>
<tbody>
<tr>
<td>Arith/Logic</td>
<td>4</td>
<td>40%</td>
<td>1.6</td>
</tr>
<tr>
<td>Load</td>
<td>5</td>
<td>30%</td>
<td>1.5</td>
</tr>
<tr>
<td>Store</td>
<td>4</td>
<td>10%</td>
<td>0.4</td>
</tr>
<tr>
<td>branch</td>
<td>3</td>
<td>20%</td>
<td>0.6</td>
</tr>
</tbody>
</table>

Average CPI: 4.1

Better than CPI = 5 if all instructions took the same number of clock cycles (5).
Instruction Pipelining

- Instruction pipelining is a CPU implementation technique where multiple operations on a number of instructions are overlapped.
  - For Example: The next instruction is fetched in the next cycle without waiting for the current instruction to complete.

- An instruction execution pipeline involves a number of steps, where each step completes a part of an instruction. Each step is called a pipeline stage or a pipeline segment.

- The stages or steps are connected in a linear fashion: one stage to the next to form the pipeline (or pipelined CPU datapath) -- instructions enter at one end and progress through the stages and exit at the other end.

- The time to move an instruction one step down the pipeline is is equal to the machine (CPU) cycle and is determined by the stage with the longest processing delay.

- Pipelining increases the CPU instruction throughput: The number of instructions completed per cycle.
  - Instruction Pipeline Throughput: The instruction completion rate of the pipeline and is determined by how often an instruction exists the pipeline.
  - Under ideal conditions (no stall cycles), instruction throughput is one instruction per machine cycle, or \[ \text{ideal effective CPI} = 1 \]

- Pipelining does not reduce the execution time of an individual instruction: The time needed to complete all processing steps of an instruction (also called instruction completion latency).
  - Minimum instruction latency = \( n \) cycles, where \( n \) is the number of pipeline stages

(In Chapter 6.1-6.6)
Pipelining: Design Goals

• The length of the machine clock cycle is determined by the time required for the slowest pipeline stage.

• An important pipeline design consideration is to balance the length of each pipeline stage.

• If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions with no stalls):

\[
\text{Time per instruction on unpipelined machine} \div \text{Number of pipeline stages}
\]

• Under these ideal conditions:
  – Speedup from pipelining = the number of pipeline stages = n
  – **Goal:** One instruction is completed every cycle: \( \text{CPI} = 1 \).
**Ideal Pipelined Instruction Processing**

**Timing Representation**

<table>
<thead>
<tr>
<th>Instruction Number</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
</tr>
</thead>
<tbody>
<tr>
<td>Instruction I</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Instruction I+1</td>
<td></td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Instruction I+2</td>
<td></td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Instruction I+3</td>
<td></td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Instruction I+4</td>
<td></td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

4 cycles = n -1 = 5 -1

Time to fill the pipeline

- **First instruction, I Completed**
- **Instruction, I+4 completed**

n= 5 Pipeline Stages:
- IF = Instruction Fetch
- ID = Instruction Decode
- EX = Execution
- MEM = Memory Access
- WB = Write Back

Any individual instruction goes through all five pipeline stages taking 5 cycles to complete

Thus instruction latency= 5 cycles

**Ideal pipeline operation without any stall cycles**

Pipeline Fill Cycles: No instructions completed yet
Number of fill cycles = Number of pipeline stages - 1
Here 5 - 1 = 4 fill cycles

Ideal pipeline operation: After fill cycles, one instruction is completed per cycle giving the ideal pipeline CPI = 1 (ignoring fill cycles)
Ideal Pipelined Instruction Processing Representation

Pipeline Fill cycles = 5 - 1 = 4

Number of pipeline fill cycles = Number of stages - 1
Here 5 - 1 = 4

After fill cycles: One instruction is completed every cycle (Effective CPI = 1)
(ideally)

Program Flow

Any individual instruction goes through all five pipeline stages taking 5 cycles to complete
Thus instruction latency = 5 cycles

Ideal pipeline operation without any stall cycles
Single Cycle, Multi-Cycle, Vs. Pipelined CPU

**Single Cycle Implementation:**
- Load: 8 ns
- Store: Waste

**Multiple Cycle Implementation:**
- Cycle 1: Load 2 ns
- Cycle 2: Store
- Cycle 3: R-type

**Pipeline Implementation:**
- Load: IF ID EX MEM WB
- Store: IF ID EX MEM WB
- R-type: IF ID EX MEM WB

Assuming the following datapath/control hardware components delays:
- Memory Units: 2 ns
- ALU and adders: 2 ns
- Register File: 1 ns
- Control Unit < 1 ns
Single Cycle, Multi-Cycle, Pipeline: Performance Comparison Example

For 1000 instructions, execution time:

\[ T = I \times CPI \times C \]

- **Single Cycle Machine:**
  - \( 8 \text{ ns/cycle} \times 1 \text{ CPI} \times 1000 \text{ inst} = 8000 \text{ ns} \)

- **Multi-cycle Machine:**
  - \( 2 \text{ ns/cycle} \times 4.6 \text{ CPI (due to inst mix)} \times 1000 \text{ inst} = 9200 \text{ ns} \)

- **Ideal pipelined machine, 5-stages (effective CPI = 1):**
  - \( 2 \text{ ns/cycle} \times (1 \text{ CPI} \times 1000 \text{ inst} + 4 \text{ cycle fill}) = 2008 \text{ ns} \)
    - Speedup = \( 8000/2008 = 3.98 \) faster than single cycle CPU
    - Speedup = \( 9200/2008 = 4.58 \) times faster than multi cycle CPU
Basic Pipelined CPU Design Steps

1. Analyze instruction set operations using independent RTN => datapath requirements.

2. Select required datapath components and connections.

3. Assemble an initial datapath meeting the ISA requirements.

4. Identify pipeline stages based on operation, balancing stage delays, and ensuring no hardware conflicts exist when common hardware is used by two or more stages simultaneously in the same cycle.

5. Divide the datapath into the stages identified above by adding buffers between the stages of sufficient width to hold:
   - Instruction fields.
   - Remaining control lines needed for remaining pipeline stages.
   - All results produced by a stage and any unused results of previous stages.

6. Analyze implementation of each instruction to determine setting of control points that effects the register transfer taking pipeline hazard conditions into account. (More on this a bit later)

7. Assemble the control logic.
A Basic Pipelined Datapath

Classic Five Stage Integer Pipeline

IF: Instruction Fetch
ID: Instruction Decode
EX: Execution
MEM: Memory
WB: Write Back

Version 1: No forwarding, Branch resolved in MEM stage
Read/Write Access To Register Bank

- Two instructions need to access the register bank in the same cycle:
  - One instruction to read operands in its instruction decode (ID) cycle.
  - The other instruction to write to a destination register in its Write Back (WB) cycle.
- This represents a potential hardware conflict over access to the register bank.
- **Solution:** Coordinate register reads and write in the same cycle as follows:

  - **Operand register reads in Instruction Decode ID cycle** occur in the second half of the cycle (indicated here by the dark shading of the second half of the cycle)
  - **Register write in Write Back WB cycle** occur in the first half of the cycle (indicated here by the dark shading of the first half of the WB cycle)

![Diagram of program execution order and time in clock cycles]

```
Program execution order (in instructions)

l $10, 20($1)
sub $11, $2, $3
```

```
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
IM Reg ALU DM Reg
IM Reg ALU DM Reg
```
Pipeline Control

- Pass needed control signals along from one stage to the next as the instruction travels through the pipeline just like the needed data.

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Execution/Address Calculation stage control lines</th>
<th>Memory access stage control lines</th>
<th>stage control lines</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Reg Dst</td>
<td>ALU Op1</td>
<td>ALU Op0</td>
</tr>
<tr>
<td>R-format</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>lw</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>sw</td>
<td>X</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>beq</td>
<td>X</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

Diagram:

```
IF /
IF/ID  

ID /
ID/EX  

EX /
EX/MEM  

MEM /
MEM/WB  

WB /
```

EECC550 - Shaaban

#15 Final Exam Review Winter 2005 2-23-2006
Pipelined Datapath with Control Added

Classic Five Stage Integer Pipeline

Figure 6.27 page 404

Version 1: No forwarding, Branch resolved in MEM stage
Basic Performance Issues In Pipelining

• **Pipelining increases the CPU instruction throughput:** The number of instructions completed per unit time.
  
  Under ideal conditions (i.e. No stall cycles):
  
  – Pipelined CPU instruction throughput is one instruction completed per machine cycle, or CPI = 1
    
    (ignoring pipeline fill cycles) Or **Instruction throughput: Instructions Per Cycle = IPC = 1**

• **Pipelining does not reduce the execution time of an individual instruction:** The time needed to complete all processing steps of an instruction (also called instruction completion latency).
  
  – It usually slightly increases the execution time of individual instructions over unpipelined CPU implementations due to:
    
    • The increased control overhead of the pipeline and pipeline stage registers delays +
    
    • Every instruction goes though every stage in the pipeline even if the stage is not needed. (i.e MEM pipeline stage in the case of R-Type instructions)
Pipelining Performance Example

Example: For an unpipelined multicycle CPU:

- Clock cycle = 10ns, 4 cycles for ALU operations and branches and 5 cycles for memory operations with instruction frequencies of 40%, 20% and 40%, respectively.
- If pipelining adds 1ns to the CPU clock cycle then the speedup in instruction execution from pipelining is:

Non-pipelined Average execution time/instruction = Clock cycle x Average CPI

= 10 ns x ((40% + 20%) x 4 + 40% x 5) = 10 ns x 4.4 = 44 ns

In the pipelined CPU implementation, ideal CPI = 1

Pipelined execution time/instruction = Clock cycle x CPI

= (10 ns + 1 ns) x 1 = 11 ns x 1 = 11 ns

Speedup from pipelining = Time Per Instruction time unpipelined

= 44 ns / 11 ns = 4 times faster

T = I x CPI x C  here I did not change
Pipeline Hazards

- Hazards are situations in pipelined CPUs which prevent the next instruction in the instruction stream from executing during the designated clock cycle possibly resulting in one or more stall (or wait) cycles.

- Hazards reduce the ideal speedup (increase CPI > 1) gained from pipelining and are classified into three classes:
  - **Structural hazards:** Arise from hardware resource conflicts when the available hardware cannot support all possible combinations of instructions.
  - **Data hazards:** Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
  - **Control hazards:** Arise from the pipelining of conditional branches and other instructions that change the PC.
Structural (or Hardware) Hazards

- In pipelined machines overlapped instruction execution requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline.

- If a resource conflict arises due to a hardware resource being required by more than one instruction in a single cycle, and one or more such instructions cannot be accommodated, then a structural hazard has occurred, for example:
  - When a pipelined machine has a shared single-memory for both data and instructions.
  → stall the pipeline for one cycle for memory data access

i.e A hardware component the instruction requires for correct execution is not available in the cycle needed
MIPS with Memory Unit Structural Hazards

A machine with only one memory port will generate a conflict whenever a memory reference occurs.
Data Hazards

- Data hazards occur when the pipeline changes the order of read/write accesses to instruction operands in such a way that the resulting access order differs from the original sequential instruction operand access order of the unpipelined CPU resulting in incorrect execution.

- Data hazards may require one or more instructions to be stalled in the pipeline to ensure correct execution.

- Example:

  1. sub $2, $1, $3
  2. and $12, $2, $5
  3. or $13, $6, $2
  4. add $14, $2, $2
  5. sw $15, 100($2)

  - All the instructions after the sub instruction use its result data in register $2
  - As part of pipelining, these instruction are started before sub is completed:
    - Due to this data hazard instructions need to be stalled for correct execution.

Arrows represent data dependencies between instructions

Instructions that have no dependencies among them are said to be parallel or independent

A high degree of Instruction-Level Parallelism (ILP) is present in a given code sequence if it has a large number of parallel instructions

i.e Correct operand data not ready yet when needed in EX cycle
Data Hazards Example

• Problem with starting next instruction before first is finished
  - Data dependencies here that “go backward in time” create data hazards.

<table>
<thead>
<tr>
<th>Time (in clock cycles)</th>
<th>CC 1</th>
<th>CC 2</th>
<th>CC 3</th>
<th>CC 4</th>
<th>CC 5</th>
<th>CC 6</th>
<th>CC 7</th>
<th>CC 8</th>
<th>CC 9</th>
</tr>
</thead>
<tbody>
<tr>
<td>Value of register $2$:</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>10</td>
<td>10</td>
<td>10</td>
<td>10</td>
<td>10/–</td>
<td>20</td>
<td>20</td>
<td>20</td>
<td>20</td>
</tr>
</tbody>
</table>

1. sub $2, $1, $3
2. and $12, $2, $5
3. or $13, $6, $2
4. add $14, $2, $2
5. sw $15, 100($2)

Program execution order (in instructions)

1. sub $2, $1, $3
2. and $12, $2, $5
3. or $13, $6, $2
4. add $14, $2, $2
5. sw $15, 100($2)
Data Hazard Resolution: Stall Cycles

Stall the pipeline by a number of cycles.
The control unit must detect the need to insert stall cycles.

In this case two stall cycles are needed.

Without forwarding
(Pipelined CPU Version 1)

2 Stall cycles inserted here to resolve data hazard and ensure correct execution
Performance of Pipelines with Stalls

- Hazard conditions in pipelines may make it necessary to stall the pipeline by a number of cycles degrading performance from the ideal pipelined CPU CPI of 1.

\[
CPI_{\text{pipelined}} = \text{Ideal CPI} + \text{Pipeline stall clock cycles per instruction} \\
= 1 + \text{Pipeline stall clock cycles per instruction}
\]

- If pipelining overhead is ignored and we assume that the stages are perfectly balanced then speedup from pipelining is given by:

\[
\text{Speedup} = \frac{\text{CPI unpipelined}}{\text{CPI pipelined}} \\
= \frac{\text{CPI unpipelined}}{1 + \text{Pipeline stall cycles per instruction}}
\]

- When all instructions in the multicycle CPU take the same number of cycles equal to the number of pipeline stages then:

\[
\text{Speedup} = \frac{\text{Pipeline depth}}{1 + \text{Pipeline stall cycles per instruction}}
\]
Data Hazard Resolution/Stall Reduction: Data Forwarding

- **Observation:**
  Why not use temporary results produced by memory/ALU and not wait for them to be written back in the register bank.

- **Data Forwarding** is a hardware-based technique (also called register bypassing or register short-circuiting) used to eliminate or minimize data hazard stalls that makes use of this observation.

- Using forwarding hardware, the result of an instruction is copied directly (i.e. forwarded) from where it is produced (ALU, memory read port etc.), to where subsequent instructions need it (ALU input register, memory write port etc.)
Pipelined Datapath With Forwarding

(Pipelined CPU Version 2: With forwarding, Branches still resolved in MEM Stage)

- The forwarding unit compares operand registers of the instruction in EX stage with destination registers of the previous two instructions in MEM and WB
- If there is a match one or both operands will be obtained from forwarding paths bypassing the registers
Data Hazard Example With Forwarding

Value of register $2$: 10 10 10 10 10/–20 –20 –20 –20 –20
Value of EX/MEM: X X X –20 X X X X X
Value of MEM/WB: X X X X X –20 X X X X

Program execution order (in instructions)

1. sub $2, $1, $3
2. and $12, $2, $5
3. or $13, $6, $2
4. add $14, $2, $2
5. sw $15, 100($2)

What registers numbers are being compared by the forwarding unit during cycle 5? What about in Cycle 6?
A Data Hazard Requiring A Stall

A load followed by an R-type instruction that uses the loaded value (or any other type of instruction that needs loaded value in ex stage)

Even with forwarding in place a stall cycle is needed
This condition must be detected by hardware
A Data Hazard Requiring A Stall

A load followed by an R-type instruction that uses the loaded value results in a single stall cycle even with forwarding as shown:

<table>
<thead>
<tr>
<th>Program execution order (in instructions)</th>
<th>Time (in clock cycles)</th>
</tr>
</thead>
<tbody>
<tr>
<td>lw $2, 20($1)</td>
<td>CC 1</td>
</tr>
<tr>
<td>and $4, $2, $5</td>
<td>CC 2</td>
</tr>
<tr>
<td>or $8, $2, $6</td>
<td>CC 3</td>
</tr>
<tr>
<td>add $9, $4, $2</td>
<td>CC 4</td>
</tr>
<tr>
<td>slt $1, $6, $7</td>
<td>CC 5</td>
</tr>
<tr>
<td></td>
<td>CC 6</td>
</tr>
<tr>
<td></td>
<td>CC 7</td>
</tr>
<tr>
<td></td>
<td>CC 8</td>
</tr>
<tr>
<td></td>
<td>CC 9</td>
</tr>
<tr>
<td></td>
<td>CC 10</td>
</tr>
</tbody>
</table>

- We can stall the pipeline by keeping all instructions following the “lw” instruction in the same pipeline stage for one cycle.

What is the hazard detection unit (shown next slide) doing during cycle 3?
Datapath With Hazard Detection Unit

A load followed by an instruction that uses the loaded value is detected by the hazard detection unit and a stall cycle is inserted. The hazard detection unit checks if the instruction in the EX stage is a load by checking its MemRead control line value. If that instruction is a load it also checks if any of the operand registers of the instruction in the decode stage (ID) match the destination register of the load. In case of a match it inserts a stall cycle (delays decode and fetch by one cycle).

A stall if needed is created by disabling instruction write (keep last instruction) in IF/ID and by inserting a set of control values with zero values in ID/EX.

(Still Pipelined CPU Version 2: With forwarding, Branches still resolved in MEM Stage)

Figure 6.36 page 416
Compiler Scheduling Example

- Reorder the instructions to avoid as many pipeline stalls as possible:

\[
\begin{align*}
&\text{lw} \quad $15, 0($2) \\
&\text{lw} \quad $16, 4($2) \\
&\text{add} \quad $14, $5, $16 \\
&\text{sw} \quad $16, 4($2)
\end{align*}
\]

- The data hazard occurs on register $16 between the second lw and the add instruction resulting in a stall cycle even with forwarding.
- With forwarding we (or the compiler) need to find only one independent instruction to place between them, swapping the lw instructions works:

\[
\begin{align*}
&\text{lw} \quad $16, 4($2) \\
&\text{lw} \quad $15, 0($2) \\
&\text{nop} \\
&\text{add} \quad $14, $5, $16 \\
&\text{sw} \quad $16, 4($2)
\end{align*}
\]

- Without forwarding we need two independent instructions to place between them, so in addition a nop is added (or the hardware will insert a stall).

\[
\begin{align*}
&\text{lw} \quad $16, 4($2) \\
&\text{lw} \quad $15, 0($2) \\
&\text{nop} \\
&\text{add} \quad $14, $5, $16 \\
&\text{sw} \quad $16, 4($2)
\end{align*}
\]

Or stall cycle → nop → add → sw → $16, 4($2)
Control Hazards

When a conditional branch is executed it may change the PC (when taken) and, without any special measures, leads to stalling the pipeline for a number of cycles until the branch condition is known and PC is updated (branch is resolved). Otherwise the PC may not be correct when needed in IF.

In current MIPS pipeline, the conditional branch is resolved in stage 4 (MEM stage) resulting in three stall cycles as shown below:

<table>
<thead>
<tr>
<th>Branch instruction</th>
<th>IF</th>
<th>ID</th>
<th>EX</th>
<th>MEM</th>
<th>WB</th>
</tr>
</thead>
<tbody>
<tr>
<td>Branch instruction</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Branch successor</td>
<td>stall</td>
<td>stall</td>
<td>stall</td>
<td>IF</td>
<td>ID</td>
</tr>
<tr>
<td>Branch successor + 1</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
</tr>
<tr>
<td>Branch successor + 2</td>
<td>3 stall cycles</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
</tr>
<tr>
<td>Branch successor + 3</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td></td>
</tr>
<tr>
<td>Branch successor + 4</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td></td>
</tr>
<tr>
<td>Branch successor + 5</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td></td>
</tr>
</tbody>
</table>

Assuming we stall or flush the pipeline on a branch instruction:

- Three clock cycles are wasted for every branch for current MIPS pipeline
  - Branch Penalty = stage number where branch is resolved - 1
    - Here Branch Penalty = 4 - 1 = 3 Cycles

i.e Correct PC is not available when needed in IF
Basic Branch Handling in Pipelines

1 One scheme discussed earlier is to **always stall** \((flush \ or \ freeze)\) the pipeline whenever a conditional branch is decoded by holding or deleting any instructions in the pipeline until the branch destination is known (zero pipeline registers, control lines).

   \[
   \text{Pipeline stall cycles from branches} = \text{frequency of branches} \times \text{branch penalty}
   \]

   - Ex: Branch frequency = 20% branch penalty = 3 cycles
     \[
     \text{CPI} = 1 + 0.2 \times 3 = 1.6
     \]

2 Another method is to **assume or predict that the branch is not taken** where the state of the machine is not changed until the branch outcome is definitely known. Execution here continues with the next instruction; **stall occurs here when the branch is taken.**

   \[
   \text{Pipeline stall cycles from branches} = \text{frequency of taken branches} \times \text{branch penalty}
   \]

   - Ex: Branch frequency = 20% of which 45% are taken branch penalty = 3 cycles
     \[
     \text{CPI} = 1 + 0.2 \times 0.45 \times 3 = 1.27
     \]
Control Hazards: Example

- Three other instructions are in the pipeline before branch instruction target decision is made when BEQ is in MEM stage.

In the above diagram, we are predicting “branch not taken”
  - Need to add hardware for flushing the three following instructions if we are wrong losing three cycles when the branch is taken.
Reducing Delay (Penalty) of Taken Branches

- So far: Next PC of a branch known or resolved in MEM stage: Costs three lost cycles if the branch is taken.
- If next PC of a branch is known or resolved in EX stage, one cycle is saved.
- Branch address calculation can be moved to ID stage (stage 2) using a register comparator, costing only one cycle if branch is taken as shown below. Branch Penalty = stage 2 - 1 = 1 cycle

Pipelined CPU
Version 3:
With forwarding, Branches resolved in ID stage

Here the branch is resolved in ID stage (stage 2)
Thus branch penalty if taken = 2 - 1 = 1 cycle

Figure 6.41 page 427
ISA Reduction of Branch Penalties: Delayed Branch

• When delayed branch is used in an ISA, the branch is delayed by \( n \) cycles (or instructions), following this execution pattern:

\[
\text{conditional branch instruction} \\
\text{sequential successor}_1 \\
\text{sequential successor}_2 \\
\ldots \\
\text{sequential successor}_n \\
\text{branch target if taken}
\]

• The sequential successor instructions are said to be in the branch delay slots. These instructions are executed whether or not the branch is taken.

• In Practice, all ISAs that utilize delayed branching including MIPS utilize a single instruction branch delay slot. (All RISC ISAs)
  – The job of the compiler is to make the successor instruction in the delay slot a valid and useful instruction.
Compiler Instruction Scheduling Example
With Branch Delay Slot

• Schedule the following MIPS code for the pipelined MIPS CPU with forwarding and reduced branch delay using a single branch delay slot to minimize stall cycles:

```
loop:  lw $1,0($2)    # $1 array element
       add $1, $1, $3  # add constant in $3
       sw $1,0($2)    # store result array element
       addi $2, $2, -4 # decrement address by 4
       bne $2, $4, loop # branch if $2 != $4
```

• Assuming the initial value of $2 = $4 + 40
  (i.e it loops 10 times)
  – What is the CPI and total number of cycles needed to run the code with and without scheduling?
Compiler Instruction Scheduling Example (With Branch Delay Slot)

• Without compiler scheduling
  
  loop:  lw $1,0($2)  
  Stall  
  add $1, $1, $3  
  sw $1,0($2)  
  addi $2, $2, -4  
  Stall  
  bne $2, $4, loop  
  Stall (or NOP)

  Ignoring the initial 4 cycles to fill the pipeline:
  Each iteration takes = 8 cycles
  CPI = 8/5 = 1.6
  Total cycles = 8 x 10 = 80 cycles

• With compiler scheduling
  
  loop:  lw $1,0($2)  
  addi $2, $2, -4  
  add $1, $1, $3  
  bne $2, $4, loop  
  sw $1, 4($2)

  Ignoring the initial 4 cycles to fill the pipeline:
  Each iteration takes = 5 cycles
  CPI = 5/5 = 1
  Total cycles = 5 x 10 = 50 cycles
  Speedup = 80/50 = 1.6

Target CPU: Pipelined CPU  Version 3: With forwarding, Branches resolved in ID stage  (Figure 6.41 page 427)
Pipeline Performance Example

- Assume the following MIPS instruction mix:

<table>
<thead>
<tr>
<th>Type</th>
<th>Frequency</th>
<th>of which 25% are followed immediately by an instruction using the loaded value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Arith/Logic</td>
<td>40%</td>
<td></td>
</tr>
<tr>
<td>Load</td>
<td>30%</td>
<td></td>
</tr>
<tr>
<td>Store</td>
<td>10%</td>
<td></td>
</tr>
<tr>
<td>branch</td>
<td>20%</td>
<td></td>
</tr>
</tbody>
</table>

- What is the resulting CPI for the pipelined MIPS with forwarding and branch address calculation in ID stage when using the branch not-taken scheme?

\[
\text{CPI} = \text{Ideal CPI} + \text{Pipeline stall clock cycles per instruction} = 1 + \frac{.3 \times .25 \times 1}{1} + \frac{.2 \times .45 \times 1}{1} = 1 + .075 + .09 = 1.165
\]

When the ideal memory assumption is removed, this CPI becomes the base CPI with ideal memory or CPI_{execution}.

Branch Penalty = 1 cycle
Levels of The Memory Hierarchy

In this course, we concentrate on the design, operation and performance of a single level of cache L1 (either unified or separate) when using non-ideal main memory.
Memory Hierarchy Operation

- If an instruction or operand is required by the CPU, the levels of the memory hierarchy are searched for the item starting with the level closest to the CPU (Level 1 cache):
  - If the item is found, it’s delivered to the CPU resulting in a **cache hit** without searching lower levels.
  - If the item is missing from an upper level, resulting in a **cache miss**, the level just below is searched.
  - For systems with several levels of cache, the search continues with cache level 2, 3 etc.
  - If all levels of cache report a miss then main memory is accessed for the item.
    - CPU ↔ cache ↔ memory: Managed by hardware.
    - If the item is not found in main memory resulting in a page fault, then disk (virtual memory), is accessed for the item.
      - Memory ↔ disk: Managed by the operating system with hardware support

Hit rate for level one cache = \( H_1 \)
Miss rate for level one cache = \( 1 - \text{Hit rate} = 1 - H_1 \)
Memory Hierarchy: Terminology

- **A Block**: The smallest unit of information transferred between two levels.
- **Hit**: Item is found in some block in the upper level (example: Block X)
  - **Hit Rate**: The fraction of memory access found in the upper level.
  - **Hit Time**: Time to access the upper level which consists of RAM access time + Time to determine hit/miss

- **Miss**: Item needs to be retrieved from a block in the lower level (Block Y)
  - **Miss Rate** = 1 - (Hit Rate)
  - **Miss Penalty**: Time to replace a block in the upper level + Time to deliver the missed block to the processor

- **Hit Time << Miss Penalty**

<table>
<thead>
<tr>
<th>To Processor</th>
<th>Upper Level Memory</th>
<th>From Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>A block</td>
<td>e.g. cache</td>
<td>Blk X</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Lower Level Memory</th>
<th>e.g. main memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>Blk Y</td>
<td></td>
</tr>
</tbody>
</table>
Basic Cache Concepts

- Cache is the first level of the memory hierarchy once the address leaves the CPU and is searched first for the requested data.

- If the data requested by the CPU is present in the cache, it is retrieved from cache and the data access is a cache hit otherwise a cache miss and data must be read from main memory.

- On a cache miss a block of data must be brought in from main memory to cache to possibly replace an existing cache block.

- The allowed block addresses where blocks can be mapped (placed) into cache from main memory is determined by cache placement strategy.

- Locating a block of data in cache is handled by cache block identification mechanism (tag checking).

- On a cache miss choosing the cache block being removed (replaced) is handled by the block replacement strategy in place.

- When a write to cache is requested, a number of main memory update strategies exist as part of the cache write policy.
Cache Block Frame

Cache is comprised of a number of cache block frames

Data Storage: Number of bytes is the size of a cache block or cache line size (Cached instructions or data go here)

Tag: Used to identify if the address supplied matches the address of the data stored

Valid Bit: Indicates whether the cache block frame contains valid data

The tag and valid bit are used to determine whether we have a cache hit or miss

Cache utilizes faster memory (SRAM)

Nominal or stated Size of cache refers to the number of bytes of data in all cache block frames and does not include tags, valid bits
Cache Organization & Placement Strategies

Placement strategies or mapping of a main memory data block onto cache block frame addresses divide cache into three organizations:

1. **Direct mapped cache:** A block can be placed in only one location (cache block frame), given by the mapping function:
   \[
   \text{index} = \frac{\text{Block address}}{\text{Number of blocks in cache}}
   \]
   Least complex to implement

2. **Fully associative cache:** A block can be placed anywhere in cache. (no mapping function).
   Most complex cache organization to implement

3. **Set associative cache:** A block can be placed in a restricted set of places, or cache block frames. A set is a group of block frames in the cache. A block is first mapped onto the set and then it can be placed anywhere within the set. The set in this case is chosen by:
   \[
   \text{index} = \frac{\text{Block address}}{\text{Number of sets in cache}}
   \]
   If there are \( n \) blocks in a set the cache placement is called \( n \)-way set-associative.
   Most common cache organization
Cache Organization: Direct Mapped Cache

A block in memory can be placed in one location (cache block frame) only, given by: \((\text{Block address}) \mod (\text{Number of blocks in cache})\)

In this case, mapping function: \((\text{Block address}) \mod (8)\)

8 cache block frames

Here four blocks in memory map to the same cache block frame

32 memory blocks cacheable

Limitation of Direct Mapped Cache: Conflicts between memory blocks that map to the same cache block frame
4KB Direct Mapped Cache Example

1K = $2^{10} = 1024$ Blocks
Each block = one word (4 bytes)

Can cache up to $2^{32}$ bytes = 4 GB of memory

Mapping function:
Cache Block frame number = (Block address) MOD (1024)

i.e. Index field or 10 low bits of block address

Block Address = 30 bits
Tag = 20 bits
Index = 10 bits
Offset = 2 bits

Address from CPU
Tag field (20 bits)
Index field (10 bits)
Block offset (2 bits)

Hit or Miss Logic
(Hit or Miss?)

SRAM

Hit
Data

Tag
Index
Offset

=
### Direct Mapped Cache Operation Example

Given a series of 16 memory address references given as word addresses:

1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17.

Assume a direct mapped cache with 16 one-word blocks that is initially empty, label each reference as a hit or miss and show the final content of cache.

Here: Block Address = Word Address  
Mapping Function = (Block Address) MOD 16

<table>
<thead>
<tr>
<th>Cache Block Frame#</th>
<th>1</th>
<th>4</th>
<th>8</th>
<th>5</th>
<th>20</th>
<th>17</th>
<th>19</th>
<th>56</th>
<th>9</th>
<th>11</th>
<th>4</th>
<th>43</th>
<th>5</th>
<th>6</th>
<th>9</th>
<th>17</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Hit</td>
<td>Miss</td>
<td>Hit</td>
<td>Hit</td>
<td>Hit</td>
</tr>
<tr>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>17</td>
<td>17</td>
<td>17</td>
<td>17</td>
<td>17</td>
<td>17</td>
<td>17</td>
<td>17</td>
<td>17</td>
<td>17</td>
<td>17</td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>9</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>11</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>12</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>13</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>14</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>15</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Initial Cache Content (empty)

Cache Content After Each Reference

Hit Rate = # of hits / # memory references = 3/16 = 18.75%

Final Cache Content

Hit Rate = # of hits / # memory references = 3/16 = 18.75%
64KB Direct Mapped Cache Example

4K = $2^{12} = 4096$ blocks
Each block = four words = 16 bytes

Can cache up to $2^{32}$ bytes = 4 GB of memory

Larger cache blocks take better advantage of spatial locality and thus may result in a lower miss rate

Mapping Function: Cache Block frame number = (Block address) MOD (4096)
i.e. Index field or 12 low bits of block address
Given the same series of 16 memory address references given as word addresses:
1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17.

Assume a direct mapped cache with four word blocks and a total of 16 words that is initially empty, label each reference as a hit or miss and show the final content of cache.

Cache has 16/4 = 4 cache block frames (each has four words)

Here: Block Address = Integer (Word Address/4)

### Mapping Function

*(Block Address) MOD 4*

<table>
<thead>
<tr>
<th>Cache Block Frame#</th>
<th>1</th>
<th>4</th>
<th>8</th>
<th>5</th>
<th>20</th>
<th>17</th>
<th>19</th>
<th>56</th>
<th>9</th>
<th>11</th>
<th>4</th>
<th>43</th>
<th>5</th>
<th>6</th>
<th>9</th>
<th>17</th>
</tr>
</thead>
<tbody>
<tr>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Hit</td>
<td>Miss</td>
<td>Miss</td>
<td>Hit</td>
<td>Miss</td>
<td>Miss</td>
<td>Hit</td>
<td>Miss</td>
<td>Hit</td>
<td>Hit</td>
<td>Miss</td>
<td>Hit</td>
<td>Hit</td>
<td>Hit/Miss</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>20</td>
<td>20</td>
<td>20</td>
<td>20</td>
<td>20</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>2</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>56</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>40</td>
<td>40</td>
<td>40</td>
<td>8</td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>8</td>
</tr>
</tbody>
</table>

**Initial Cache Content (empty)**

**Starting word address of Cache Frames**

**Content After Each Reference**

Hit Rate = # of hits / # memory references = 6/16 = 37.5%
### Block size = 4 words

**Mapping**
- i.e. low two bits of block address

<table>
<thead>
<tr>
<th>Given Word address</th>
<th>Block address</th>
<th>Cache Block Frame # ((\text{Block address}) \mod 4)</th>
<th>word address range in frame (4 words)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0-3</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>1</td>
<td>4-7</td>
</tr>
<tr>
<td>8</td>
<td>2</td>
<td>2</td>
<td>8-11</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>1</td>
<td>4-7</td>
</tr>
<tr>
<td>20</td>
<td>5</td>
<td>1</td>
<td>20-23</td>
</tr>
<tr>
<td>17</td>
<td>4</td>
<td>0</td>
<td>16-19</td>
</tr>
<tr>
<td>19</td>
<td>4</td>
<td>0</td>
<td>16-19</td>
</tr>
<tr>
<td>56</td>
<td>14</td>
<td>2</td>
<td>56-59</td>
</tr>
<tr>
<td>9</td>
<td>2</td>
<td>2</td>
<td>8-11</td>
</tr>
<tr>
<td>11</td>
<td>2</td>
<td>2</td>
<td>8-11</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>1</td>
<td>4-7</td>
</tr>
<tr>
<td>43</td>
<td>10</td>
<td>2</td>
<td>40-43</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>1</td>
<td>4-7</td>
</tr>
<tr>
<td>6</td>
<td>1</td>
<td>1</td>
<td>4-7</td>
</tr>
<tr>
<td>9</td>
<td>2</td>
<td>2</td>
<td>8-11</td>
</tr>
<tr>
<td>17</td>
<td>4</td>
<td>0</td>
<td>16-19</td>
</tr>
</tbody>
</table>

**Block Address** = Integer (Word Address/4)

---

**Word Addresses vs. Block Addresses and Frame Content for Previous Example**

- 2 bits
- i.e. low two bits of block address

---

EECC550 - Shaaban

#52 Final Exam Review Winter 2005 2-23-2006
Cache Organization:

Set Associative Cache

Set associative cache reduces cache misses by **reducing conflicts** between blocks that would have been mapped to the same cache block frame in the case of direct mapped cache.

1-way set associative: (direct mapped)
1 block frame per set

2-way set associative:
2 blocks frames per set

4-way set associative:
4 blocks frames per set

8-way set associative:
8 blocks frames per set
In this case it becomes fully associative since total number of block frames = 8

EECC550 - Shaaban
Cache Organization/Mapping Example

- **Fully associative:**
  - block 12 can go anywhere

- **Direct mapped:**
  - block 12 can go only into block 4
  - (12 mod 8)

- **2-way:**
  - Set associative:
    - block 12 can go anywhere in set 0
    - (12 mod 4)

---

**Cache**
8 Block Frames

**Memory**
32 Block Frames

This example cache has eight block frames and memory has 32 blocks.

12 = 1100
4K Four-Way Set Associative Cache: MIPS Implementation Example

1024 block frames
Each block = one word
4-way set associative
1024 / 4 = 2^8 = 256 sets

Can cache up to 2^{32} bytes = 4 GB of memory

Set associative cache requires parallel tag matching and more complex hit logic which may increase hit time

Mapping Function: Cache Set Number = index = (Block address) MOD (256)
Cache Replacement Policy

• When a cache miss occurs the cache controller may have to select a block of cache data to be removed from a cache block frame and replaced with the requested data, such a block is selected by one of three methods:

(No cache replacement policy in direct mapped cache)

– **Random:**
  • Any block is randomly selected for replacement providing uniform allocation.
  • Simple to build in hardware. Most widely used cache replacement strategy.

– **Least-recently used (LRU):**
  • Accesses to blocks are recorded and and the block replaced is the one that was not used for the longest period of time.
  • Full LRU is *expensive* to implement, as the number of blocks to be tracked increases, and is usually approximated by block usage bits that are cleared at regular time intervals.

– **First In, First Out (FIFO):**
  • Because LRU can be complicated to implement, this approximates LRU by determining the oldest block rather than LRU
## Miss Rates for Caches with Different Size, Associativity & Replacement Algorithm

### Sample Data

<table>
<thead>
<tr>
<th>Associativity:</th>
<th>2-way</th>
<th>4-way</th>
<th>8-way</th>
</tr>
</thead>
<tbody>
<tr>
<td>Size</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>16 KB</td>
<td>LRU</td>
<td>Random</td>
<td>LRU</td>
</tr>
<tr>
<td></td>
<td>5.18%</td>
<td>5.69%</td>
<td>4.67%</td>
</tr>
<tr>
<td>64 KB</td>
<td>LRU</td>
<td>Random</td>
<td>LRU</td>
</tr>
<tr>
<td></td>
<td>1.88%</td>
<td>2.01%</td>
<td>1.54%</td>
</tr>
<tr>
<td>256 KB</td>
<td>LRU</td>
<td>Random</td>
<td>LRU</td>
</tr>
<tr>
<td></td>
<td>1.15%</td>
<td>1.17%</td>
<td>1.13%</td>
</tr>
</tbody>
</table>

Program steady state cache miss rates are given
Initially cache is empty and miss rates ~ 100%

FIFO replacement miss rates (not shown here) is better than random but worse than LRU

For SPEC92
### 2-Way Set Associative Cache Operation Example

- **Given the same series of 16 memory address references given as word addresses:**

  1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17.  (LRU Replacement)

- **Assume a two-way set associative cache with one word blocks and a total size of 16 words that is initially empty, label each reference as a hit or miss and show the final content of cache**

- **Here:** Block Address = Word Address  
  Mapping Function = Set # = (Block Address) MOD 8

<table>
<thead>
<tr>
<th>Cache Set #</th>
<th>1</th>
<th>4</th>
<th>8</th>
<th>5</th>
<th>20</th>
<th>17</th>
<th>19</th>
<th>56</th>
<th>9</th>
<th>11</th>
<th>4</th>
<th>43</th>
<th>5</th>
<th>6</th>
<th>9</th>
<th>17</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Miss</td>
<td>Hit</td>
<td>Miss</td>
<td>Hit</td>
<td>Miss</td>
<td>Hit</td>
<td>Hit</td>
<td>Hit</td>
<td>Hit/Miss</td>
</tr>
<tr>
<td>0</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>LRU</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>9</td>
<td>9</td>
<td>9</td>
<td>9</td>
<td>9</td>
<td>9</td>
<td>9</td>
<td>9</td>
<td>LRU</td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>19</td>
<td>19</td>
<td>19</td>
<td>19</td>
<td>19</td>
<td>19</td>
<td>19</td>
<td>43</td>
<td>43</td>
<td>43</td>
<td>43</td>
<td>43</td>
<td>LRU</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>LRU</td>
</tr>
<tr>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>Final Cache Content</td>
</tr>
<tr>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Cache Content After Each Reference**

Hit Rate = # of hits / # memory references = 4/16 = 25%

Replacement policy: LRU = Least Recently Used
Locating A Data Block in Cache

- Each block frame in cache has an address tag.
- The tags of every cache block that might contain the required data are checked or searched in parallel.
- A valid bit is added to the tag to indicate whether this entry contains a valid address.
- The address from the CPU to cache is divided into:
  - A block address, further divided into:
    - An index field to choose a block set or frame in cache.
      (no index field when fully associative).
    - A tag field to search and match addresses in the selected set.
  - A block offset to select the data from the block.
Address Field Sizes/Mapping

Physical Address Generated by CPU
(The size of this address depends on amount of cacheable physical main memory)

Block Address

Tag

Index

Offset

Block offset size = \( \log_2(\text{block size}) \)

Index size = \( \log_2(\text{Total number of blocks/associativity}) \)

Tag size = address size - index size - offset size

Mapping function: (From memory block to cache)

Cache set or block frame number = Index =

= (Block Address) MOD (Number of Sets)

Fully associative cache has no index field or mapping function

e.g. no index field
Calculating Number of Cache Bits Needed

How many total bits are needed for a direct-mapped cache with 64 KBytes of data and one word blocks, assuming a 32-bit address?

- 64 Kbytes = 16 K words = $2^{14}$ words = $2^{14}$ blocks
- Block size = 4 bytes => offset size = $\log_2(4) = 2$ bits,
- $#\text{sets} = #\text{blocks} = 2^{14} =>$ index size = 14 bits
- Tag size = address size - index size - offset size = 32 - 14 - 2 = 16 bits
- Bits/block = data bits + tag bits + valid bit = 32 + 16 + 1 = 49
- Bits in cache = #blocks x bits/block = $2^{14} x 49 = 98$ Kbytes

How many total bits would be needed for a 4-way set associative cache to store the same amount of data?

- Block size and #blocks does not change.
- $#\text{sets} = #\text{blocks}/4 = (2^{14})/4 = 2^{12} =>$ index size = 12 bits
- Tag size = address size - index size - offset = 32 - 12 - 2 = 18 bits
- Bits/block = data bits + tag bits + valid bit = 32 + 18 + 1 = 51
- Bits in cache = #blocks x bits/block = $2^{14} x 51 = 102$ Kbytes

Increase associativity => increase bits in cache

More bits in tag
Calculating Cache Bits Needed

- How many total bits are needed for a direct-mapped cache with 64 KBytes of data and 8 word (32 byte) blocks, assuming a 32-bit address (it can cache $2^{32}$ bytes in memory)?
  - $64 \text{ Kbytes} = 2^{14} \text{ words} = (2^{14})/8 = 2^{11} \text{ blocks}$
  - block size = 32 bytes
    => offset size = block offset + byte offset = $\log_2(32) = 5$ bits,
  - #sets = #blocks = $2^{11}$ => index size = 11 bits
  - tag size = address size - index size - offset size = $32 - 11 - 5 = 16$ bits
  - bits/block = data bits + tag bits + valid bit = $8 \times 32 + 16 + 1 = 273$ bits
  - bits in cache = #blocks x bits/block = $2^{11} \times 273 = 68.25 \text{ Kbytes}$

- Increase block size => decrease bits in cache.

Fewer cache block frames thus fewer tags/valid bits
Unified vs. Separate Level 1 Cache

- **Unified Level 1 Cache (Princeton Memory Architecture).**
  A single level 1 ($L_1$) cache is used for both instructions and data.

- **Separate instruction/data Level 1 caches (Harvard Memory Architecture):**
  The level 1 ($L_1$) cache is split into two caches, one for instructions (instruction cache, $L_1$ I-cache) and the other for data (data cache, $L_1$ D-cache).

---

Unified Level 1 Cache (Princeton Memory Architecture)

Separate (Split) Level 1 Caches (Harvard Memory Architecture)
Memory Hierarchy/Cache Performance:
Average Memory Access Time (AMAT), Memory Stall cycles

- **The Average Memory Access Time (AMAT):** The number of cycles required to complete an average memory access request by the CPU.
- **Memory stall cycles per memory access:** The number of stall cycles added to CPU execution cycles for one memory access.
- **Memory stall cycles per average memory access = (AMAT -1)**
- **For ideal memory:** AMAT = 1 cycle, this results in zero memory stall cycles.
- **Memory stall cycles per average instruction =**

  Number of memory accesses per instruction

  \[
  \text{Instruction Fetch} \quad \times \text{Memory stall cycles per average memory access} \\
  = (1 + \text{fraction of loads/stores}) \times (\text{AMAT} - 1) \\
  \]

  **Base CPI = CPI}_{\text{execution}} = CPI with ideal memory**

  **CPI = CPI}_{\text{execution}} + Mem Stall cycles per instruction**
**Cache Performance:**

**Single Level L1 Princeton (Unified) Memory Architecture**

CPU time = Instruction count \( \times \) CPI \( \times \) Clock cycle time

\( \text{CPI}_{\text{execution}} = \text{CPI with ideal memory} \)

\[
\begin{align*}
\text{CPI} &= \text{CPI}_{\text{execution}} + \text{Mem Stall cycles per instruction} \\
\text{Mem Stall cycles per instruction} &= \text{Memory accesses per instruction} \times \text{Memory stall cycles per access}
\end{align*}
\]

Assuming no stall cycles on a cache hit (cache access time = 1 cycle, stall = 0)

Cache Hit Rate = \(H_1\) \quad Miss Rate = 1- \(H_1\) \quad Miss Penalty = \(M\)

Memory stall cycles per memory access = Miss rate \times Miss penalty = (1- \(H_1\)) \times M

\(\text{AMAT} = 1 + \text{Miss rate} \times \text{Miss penalty}\)

Memory accesses per instruction = (1 + fraction of loads/stores)

Miss Penalty = \(M = \text{the number of stall cycles resulting from missing in cache}

= \text{Main memory access time} - 1

Thus for a unified L1 cache with no stalls on a cache hit:

\[
\begin{align*}
\text{CPI} &= \text{CPI}_{\text{execution}} + (1 + \text{fraction of loads/stores}) \times (1 - H_1) \times M \\
\text{AMAT} &= 1 + (1 - H_1) \times M
\end{align*}
\]
Memory Access Tree:
For Unified Level 1 Cache

L1
L1 Hit:
% = Hit Rate = H1
Hit Access Time = 1
Stall cycles per access = 0
Stall = H1 x 0 = 0
(No Stall)

L1 Miss:
% = (1 - Hit rate) = (1 - H1)
Access time = M + 1
Stall cycles per access = M
Stall = M x (1 - H1)

AMAT = H1 x 1 + (1 - H1) x (M + 1) = 1 + M x (1 - H1)

Stall Cycles Per Access = AMAT - 1 = M x (1 - H1)
CPI = CPI_{execution} + (1 + fraction of loads/stores) x M x (1 - H1)

M = Miss Penalty = stall cycles per access resulting from missing in cache
M + 1 = Miss Time = Main memory access time
H1 = Level 1 Hit Rate
1 - H1 = Level 1 Miss Rate
Cache Performance Example

- Suppose a CPU executes at Clock Rate = 200 MHz (5 ns per cycle) with a single level of cache.
- \( \text{CPI}_{\text{execution}} = 1.1 \) (i.e base CPI with ideal memory)
- Instruction mix: 50% arith/logic, 30% load/store, 20% control
- Assume a cache miss rate of 1.5% and a miss penalty of \( M = 50 \) cycles.

\[
\text{CPI} = \text{CPI}_{\text{execution}} + \text{mem stalls per instruction}
\]

Mem Stalls per instruction =

- Mem accesses per instruction \( \times \) Miss rate \( \times \) Miss penalty
- Mem accesses per instruction = 1 + .3 = 1.3

Instruction fetch Load/store

Mem Stalls per memory access = (1 - H1) \( \times \) M = .015 \( \times \) 50 = .75 cycles

AMAT = 1 + .75 = 1.75 cycles

Mem Stalls per instruction = 1.3 \( \times \) .015 \( \times \) 50 = 0.975

CPI = 1.1 + .975 = 2.075

The ideal memory CPU with no misses is 2.075/1.1 = 1.88 times faster
Cache Performance Example

• Suppose for the previous example we double the clock rate to 400 MHz, how much faster is this machine, assuming similar miss rate, instruction mix?

• Since memory speed is not changed, the miss penalty takes more CPU cycles:

  Miss penalty = M = 50 x 2 = 100 cycles.

  CPI = 1.1 + 1.3 x .015 x 100 = 1.1 + 1.95 = 3.05

  Speedup = (CPI\text{old} x C_{\text{old}})/ (CPI_{\text{new}} x C_{\text{new}})

  = 2.075 x 2 / 3.05 = 1.36

The new machine is only 1.36 times faster rather than 2 times faster due to the increased effect of cache misses.

→ CPUs with higher clock rate, have more cycles per cache miss and more memory impact on CPI.
Cache Performance:

Single Level L1 Harvard (Split) Memory Architecture

For a CPU with separate or split level one (L1) caches for instructions and data (Harvard memory architecture) and no stalls for cache hits:

\[
\text{CPUtime} = \text{Instruction count} \times \text{CPI} \times \text{Clock cycle time}
\]

\[
\text{CPI} = \text{CPI}_{\text{execution}} + \text{Mem Stall cycles per instruction}
\]

\[
\text{Mem Stall cycles per instruction} = \text{Instruction Fetch Miss rate} \times M + \text{Data Memory Accesses Per Instruction} \times \text{Data Miss Rate} \times M
\]

\[
M = \text{Miss Penalty} = \text{stall cycles per access to main memory resulting from missing in cache}
\]
Memory Access Tree
For Separate Level 1 Caches

CPU Memory Access

L1

% Instructions

%data

Instruction

Data

%instructions x Instruction H1

%instructions x (1 - Instruction H1)

%data x Data H1

% data x (1 - Data H1)

Instruction L1 Hit:
Hit Access Time = 1
Stalls = 0

Instruction L1 Miss:
Access Time = M + 1
Stalls Per access = M
Stalls = %instructions x (1 - Instruction H1) x M

Data L1 Hit:
Hit Access Time = 1
Stalls = 0

Data L1 Miss:
Access Time = M + 1
Stalls per access = M
Stalls = %data x (1 - Data H1) x M

Ideal access on a hit

Stall Cycles Per Access = % Instructions x (1 - Instruction H1) x M + % data x (1 - Data H1) x M

AMAT = 1 + Stall Cycles per access

Stall cycles per instruction = (1 + fraction of loads/stores) x Stall Cycles per access

CPI = CPIexecution + Stall cycles per instruction
    = CPIexecution + (1 + fraction of loads/stores) x Stall Cycles per access

M = Miss Penalty = stall cycles per access resulting from missing in cache
M + 1 = Miss Time = Main memory access time
Data H1 = Level 1 Data Hit Rate
1- Data H1 = Level 1 Data Miss Rate
Instruction H1 = Level 1 Instruction Hit Rate
1- Instruction H1 = Level 1 Instruction Miss Rate
% Instructions = Percentage or fraction of instruction fetches out of all memory accesses
% Data = Percentage or fraction of data accesses out of all memory accesses
Cache Performance Example

- Suppose a CPU uses separate level one (L1) caches for instructions and data (Harvard memory architecture) with different miss rates for instruction and data access:
  - A cache hit incurs no stall cycles while a cache miss incurs 200 stall cycles for both memory reads and writes.
  - \( \text{CPI}_{\text{execution}} = 1.1 \) (i.e. base CPI with ideal memory)
  - Instruction mix: 50% arith/logic, 30% load/store, 20% control
  - Assume a cache miss rate of 0.5% for instruction fetch and a cache data miss rate of 6%.
  - A cache hit incurs no stall cycles while a cache miss incurs 200 stall cycles for both memory reads and writes.

- Find the resulting CPI using this cache?

\[
\text{CPI} = \text{CPI}_{\text{execution}} + \text{mem stalls per instruction}
\]

\[
\text{Mem Stall cycles per instruction} = \text{Instruction Fetch Miss rate x Miss Penalty} + \text{Data Memory Accesses Per Instruction x Data Miss Rate x Miss Penalty}
\]

\[
\text{Mem Stall cycles per instruction} = 0.5/100 \times 200 + 0.3 \times 6/100 \times 200 = 1 + 3.6 = 4.6
\]

\[
\text{CPI} = \text{CPI}_{\text{execution}} + \text{mem stalls per instruction} = 1.1 + 4.6 = 5.7
\]

- What is the miss rate of a single level unified cache that has the same performance?

\[
4.6 = 1.3 \times \text{Miss rate} \times 200 \quad \text{which gives a miss rate of 1.8% for an equivalent unified cache}
\]

- How much faster is the CPU with ideal memory?

The CPU with ideal cache (no misses) is \( 5.7/1.1 = 5.18 \) times faster

With no cache the CPI would have been \( 1.1 + 1.3 \times 200 = 261.1 \)!!
Memory Access Tree For Separate Level 1 Caches Example

30% of all instructions executed are loads/stores, thus:
Fraction of instruction fetches out of all memory accesses = 1/ (1+0.3) = 1/1.3 = 0.769 or 76.9 %
Fraction of data accesses out of all memory accesses = 0.3/ (1+0.3) = 0.3/1.3 = 0.231 or 23.1 %

CPU Memory Access

L1

%instructions x Instruction H1
= .765 or 76.5 %

Instruction L1 Hit:
Hit Access Time = 1
Stalls = 0

Ideal access on a hit

%Instructions = 0.769 or 76.9 %

Instruction L1 Miss:
Access Time = M + 1 = 201
Stalls Per access = M = 200
Stalls = %instructions x (1 - Instruction H1 ) x M
= 0.003846 x 200 = 0.7692 cycles

Stall Cycles Per Access = % Instructions x (1 - Instruction H1 ) x M + % data x (1 - Data H1 ) x M
= 0.7692 + 2.769 = 3.54 cycles

AMAT = 1 + Stall Cycles per access = 1 + 3.5 = 4.54 cycles

Stall cycles per instruction = (1 + fraction of loads/stores) x Stall Cycles per access = 1.3 x 3.54 = 4.6 cycles

CPI = CPI_{execution} + Stall cycles per instruction = 1.1 + 4.6 = 5.7

Data

%data x (1 - Data H1 )
= 0.01385 or 1.385 %

Data L1 Hit:
Hit Access Time: = 1
Stalls = 0

Ideal access on a hit

%data = Percentage or fraction of data accesses out of all memory accesses = 23.1 %

100%

% data = 0.231 or 23.1 %

% instructions x (1 - Instruction H1)
= 0.2169 or 21.69 %

Data L1 Miss:
Access Time = M + 1 = 201
Stalls per access: M = 200
Stalls = %data x (1 - Data H1 ) x M
= 0.01385 x 200 = 2.769 cycles

M = Miss Penalty = stall cycles per access resulting from missing in cache = 200 cycles
M + 1 = Miss Time = Main memory access time = 200+1 =201 cycles
L1 access Time = 1 cycle
Data H1 = 0.94 or 94%
1- Data H1 = 0.06 or 6%
Instruction H1 = 0.995 or 99.5%
1- Instruction H1 = 0.005 or 0.5%
% Instructions = Percentage or fraction of instruction fetches out of all memory accesses = 76.9 %
% Data = Percentage or fraction of data accesses out of all memory accesses = 23.1 %