(REVISED 21-10-2024)

| SURNAME |  |
|---------|--|
|         |  |

1) (POINTS 35/40) Consider a **triple-dispatch** (3 instructions per cycle) processor using Tomasulo's algorithm to perform the dynamic scheduling of instructions on the pipeline shown in the following figure. This pipeline is executing the following program, which performs a search within a vector (initially, R1=0).

```
etic: LW R2, 0(R1) ; read Xi
ADDI R2, R2, 3 ; adds 3 to Xi
SW R2, 0(R1) ; write Xi
ADDI R1, R1, 4 ; update R1
BNE R2, R0, etic ; continue to loop if false
```



FIRST NAME

## Working hypothesis:

- the loop executes speculatively in terms of direction (always taken) and regarding the branch condition; high-performance fetch breaks after fetching a branch
- the issue stage (I) calculates the address of the actual read/write and push it into load/store queues; only 1 instruction is issued per cycle
- reads require 5 clock cycles; writes take 1 cycle
- when accessing memory (M), writes have precedence over reads and must be executed in-order
- there is a single CDB
- dispatch stage (P) and complete stage (W) require 1 clock cycle
- ASSUME that the reservation stations could be freed right before the start of issue phase (therefore extending the duration of P stage)
- only 1 instruction is committed (C stage) per cycle
- there are separated integer units: one for the calculation of the actual address, one for arithmetic and logical operations, one of the integer multiplication and one for the evaluation of the branch condition, as illustrated in this table:

| Type of Functional Unit       | No. of Functional Units | Cycles for stage I+X | No. of reservation stations |
|-------------------------------|-------------------------|----------------------|-----------------------------|
| LS: Integer (effective addr.) | 1                       | 1                    | 2                           |
| A: Integer (op. A-L)          | 1                       | 1                    | 2                           |
| B: Integer (branch calc.)     | 1                       | 1                    | 2                           |

- the functional units do not take advantage of pipelining techniques internally
- the load queue has 3 slots; the store queue has 3 slots (writes wait for the operand in the store queue, i.e., in the execution stage)

Complete the following chart until the end of the FOURTH iteration of the above code fragment in the case of dynamic scheduling with speculation. Also add the instruction that occupies a certain reservation station (one of the 6) as indicated:

| Instr.<br>No | Instruction name | ALU<br>RS1 | ALU<br>RS2 | LS<br>RS1  | LS<br>RS2 | BU<br>RS1 | BU<br>RS2 | P: disPatch<br>(clock) | I+X:Issue+Exec<br>(clock) | M: MEM.ACCESS<br>(start-stop) | W: CDB-write<br>(clock) | C: Commit<br>(clock) | Comments |
|--------------|------------------|------------|------------|------------|-----------|-----------|-----------|------------------------|---------------------------|-------------------------------|-------------------------|----------------------|----------|
| 101          | LW R2,0(R1)      |            |            | I01<br>1-1 |           |           |           | 1                      | 2                         | 3-7                           | 8                       | 9                    |          |
|              |                  |            |            |            |           |           |           |                        |                           |                               |                         |                      |          |
|              |                  |            |            |            |           |           |           |                        |                           |                               |                         |                      |          |

1) (5/40) Explain what is (i) an anti-dependency and (ii) an output-dependency and give an assembly example for each one of the two cases in the code of the question (1).

1)



- 2) Given two subsequent instructions I and J:
  - Antidependency: I reads a register that is written by J
  - Output-dependency: both I and J write in the same register

## In the case of our code:

• Antidependency:

I05: bne R2 R0, etic I06: lw R2, 0(R1)

I05: bne R2 R0, etic I07: addi R2, R2, 1

• Output-dependency:

I01: lv R2, 0(R1) I02: addi R2, R2, 1

I04: addi R1 R1, 4 I09: addi R1 R1, 4

I02: addi R2, R2, 1 I07: addi R2 R2, 1