## HIGH PERFORMANCE COMPUTER ARCHITECTURE 20-01-2010 (FORMER "CALCOLATORI ELETTRONICI 2") **REVISED 16-10-202**4

## MATRICULATION NO.

SURNAME FIRST NAME



- reads require 1 clock cycle; writes require 1 clock cycles
- when accessing memory (M), writes have precedence over reads and must be executed in-order
- · there is a single CDB
- dispatch stage (D) and complete stage (C) require 1 clock cycle
- there are separated integer units for the calculation of the actual address, for arithmetic and logical operations, for the evaluation of the branch condition
- the functional units do not take advantage of pipelining techniques internally (reservation stations are busy until the end of CDB-write, except for Stores)
- the load buffer has 5 slots
- the store queue has 5 slots (writes wait for the operand in the store queue, i.e. in the execution stage)
- · the rest of the processor and has the following characteristics

| Type of Functional Unit   | No. of Functional Units | Cycles for stage I+X | No. of reservation stations |
|---------------------------|-------------------------|----------------------|-----------------------------|
| Integer (effective addr.) | 1                       | 1                    | 2                           |
| Integer (op. A-L)         | 1                       | 1                    | 2                           |
| Integer (branch calc.)    | 1                       | 1                    | 2                           |
| FP Adder                  | 1                       | 4                    | 3                           |
| FP Multiplier             | 1                       | 8                    | 3                           |
| FP Divider                | 1                       | 15                   | 2                           |

Complete the following chart until the end of the third iteration of the code fragment above in the case of simple dynamic scheduling.

| Iter. | Instruction    | P disPatch<br>(start-stop) | I+X Issue<br>(start-stop) | M MEM<br>ACCESS (clock) | W CDB-Write<br>(Complete) (clock) | C Commit<br>(clock) | Comments |
|-------|----------------|----------------------------|---------------------------|-------------------------|-----------------------------------|---------------------|----------|
| 1     | L.D F2,0(R1)   |                            | (start stop)              | 2                       | (complete) (clock)                | (elock)             |          |
|       | L.D F2,0(RI)   | 1-4                        | 2                         | 3                       | 4                                 | 3                   |          |
| 1     | MUL.D F4,F2,F0 | 1-13                       | 5-12                      |                         | 13                                | 14                  |          |
| 1     | L.D F6,400(R1) |                            |                           |                         |                                   |                     |          |

| 2) (POINTS 17/40) The test-and-set method is the simplest synchronization mechanism and it        | P0 Jale                          | P1 date                          | P15                              |
|---------------------------------------------------------------------------------------------------|----------------------------------|----------------------------------|----------------------------------|
| is available in the large majority of commercial shared-memory machines. Such mechanism           | tence st tab                     | 18/10 25 189                     | terce state                      |
| is based on the atomic exchange operation <b>EXCH</b> that consists in loading the old value at a | Cone Addre Data                  | Cone Pigne Days                  | Come Addre Data                  |
| given address and store into the same address a new value. The "lock" mechanism is in turn        | B0 I 100 00 10                   | B0 I 100 00 10                   | ••• B0 S 120 00 20               |
| implemented upon such atomic operation by spinning on a specific memory address until the         | B1 S 108 00 08                   | B1 M 128 00 68                   | B1 S 108 00 08                   |
| lock is open (the returned value is a zero, meaning "unlocked", instead of a one meaning          | B2 M 110 00 30<br>B3 I 118 00 10 | B2 I 110 00 10<br>B3 S 118 00 18 | B2 I 110 00 10<br>B3 I 118 00 10 |
| "locked"). The following code represent a possible implementation:                                |                                  |                                  |                                  |
| LOCK CODE:                                                                                        | Ţ                                | Ţ                                |                                  |
| tas: ADDI R2, R0, 1                                                                               |                                  | 1                                |                                  |
| lockit: EXCH R2, 0(R1)                                                                            |                                  | Memory Address Data              |                                  |
| BNE R2, R0, lockit                                                                                |                                  |                                  |                                  |
| UNLOCK CODE:                                                                                      |                                  | 100 00 00                        |                                  |
| unlock: SW R0,0(R1)                                                                               |                                  | 108 00 08                        |                                  |
| Let's consider a situation in which three processors (P0, P1, P15) that try to lock the address   |                                  | 110 00 10<br>118 00 18           |                                  |
| 0x100 in a machine having 16 processors. Assume an MSI coherence protocol and the cache           |                                  | 120 00 20                        |                                  |
| contents represented in figure. The bus-transaction costs are:                                    |                                  | 128 00 28                        |                                  |
| • Creadblk=100, Ccache-to-cache=70, Cinvalidate=15, Cwrite-back=10.                               |                                  | 130 00 30                        |                                  |
| For the sake of simplicity, assume also that the critical sections last 1000 cycles.              |                                  |                                  |                                  |

Assuming that the processors acquire the lock in the order  $P0 \rightarrow P1 \rightarrow P15$  and given the initial situation of caches and memory represented above, calculate: a) how many bus transactions are there; b) how many memory stall cycles for each of the processors are necessary before acquiring the lock.

3) (POINTS 6/40) Calculate the PARALLELISM, by using WORK e SPAN, for the following Cilk implementation of the recursive Fibonacci code in case of n=5.

int fib(int n) { if (n < 2) return; int x, y; x = cilk\_spawn fib(n-1); y = fib(n-2);cilk\_sync; return x+y;

## HIGH PERFORMANCE COMPUTER ARCHITECTURE 20-01-2010 SOLUTION REVISED 16-10-2024

## EXERCIZE 1:

| Iter. | Instruction    | P: Dispatch<br>(start-stop) | I+X:<br>Issue+Exec  | M: MEM<br>ACCESS | W: CDB-write<br>(Complete)<br>(clock) | C: Commit<br>(clock) | Comments                                                      |
|-------|----------------|-----------------------------|---------------------|------------------|---------------------------------------|----------------------|---------------------------------------------------------------|
| 1     | L.D F2,0(R1)   | 1-4                         | (start-stop)<br>2-2 | (clock)<br>3     | 4                                     | 5                    |                                                               |
| 1     | MUL.D F4,F2,F0 | 1-13                        | 5-12                | 1                |                                       | 14                   | I waits F2 from 1/L.D                                         |
| 1     | L.D F6,400(R   | L) 2-5                      | 3-3                 | .4               | $\mathbf{5}$                          | 15                   |                                                               |
| 1     | DIV.D F6,F4,F6 | 2-29                        | 14-28               | D .              | (29)                                  | 30                   | I waits F4 from 1/MUL.D                                       |
| 1     | S.D F6,400(R   | L) <mark>5-6</mark>         | 6-6                 | 30               |                                       | 31                   | P waits LS-RS,M waits F6 from 1/DIV.D                         |
| 1     | ADDI R1,R1,8   | <mark>5-7</mark>            | <mark>6-6</mark>    |                  |                                       | 32                   |                                                               |
| 1     | SGTI R3,R1,80  | o <mark>6-9</mark>          | <mark>8-8</mark> <  | 4                | (9)                                   | 33                   | I waits R1 from 1/ADDI                                        |
| 1     | BEQ R3,R0,et:  | ic <mark>6-10</mark>        | <mark>10-10</mark>  |                  |                                       | 34                   | I waits R3 from 1/SGTI                                        |
| 2     | L.D F2,0(R1)   | 7-10                        | ) <mark>8-8</mark>  | 9                | 10                                    | 35                   |                                                               |
| 2     | MUL.D F4,F2,F0 | <mark>7-21</mark>           | 13-20               | 5                | 21                                    | 36                   | I waits F2 from 2/L.D & MUL-FU avail.                         |
| 2     | L.D F6,400(R   | L) <mark>8-11</mark>        | <mark>9-9</mark>    | 10               | 11                                    | 37                   |                                                               |
| 2     | DIV.D F6,F4,F6 | ( <mark>8-44</mark>         | 29-43               | 7                | (44)                                  | 45                   | I waits F4 from 1/MUL.D                                       |
|       |                |                             |                     |                  |                                       |                      | & DIV-FU avail.                                               |
| 2     | S.D F6,400(R   | L) <mark>11-12</mark>       | 12-1 <mark>2</mark> |                  |                                       | 46                   | P waits LS-RS,M waits F6 from 2/DIV.D                         |
| 2     | ADDI R1,R1,8   | <mark>11-14</mark>          | <mark>12-1</mark> 2 | 1                |                                       | 47                   | CDB waits bus avail.                                          |
| 2     | SGTI R3,R1,80  | ) <u>12-16</u>              | 15-15               | Ϊ                |                                       | 48                   | I waits R1 from 2/ADDI                                        |
| 2     | BEQ R3,R0,et:  | ic <mark>12-17</mark>       | 17-17               |                  |                                       | 49                   | I waits R3 from 2/SGTI                                        |
| 3     | L.D F2,0(R1)   | <mark>13-17</mark>          | 15-15               | 16               | <u>.17</u>                            | 50                   | CDB waits bus avail.                                          |
| 3     | MUL.D F4,F2,F0 | <mark>13-30</mark>          | 21-28               |                  | 30                                    | 51                   | I waits F2 from 2/L.D & MUL-FU avail.<br>CDB waits bus avail. |
| 3     | L.D F6,400(R   | L) <mark>14-1</mark> 8      | <mark>16-1</mark> 5 | <mark>17</mark>  | <mark>18</mark>                       | 52                   | LS-FU avail.                                                  |
| 3     | DIV.D F6,F4,F6 | 30-59                       | 44-58               |                  | 59                                    | 60                   | P waits DIV-RS available,                                     |
|       |                |                             |                     | K                |                                       |                      | I waits DIV-FU avail.                                         |
| 3     | S.D F6,400(R   | L) <mark>30-</mark> 31      | <mark>31-31</mark>  | <mark>60</mark>  |                                       | 61                   | M waits F6 from 3/DIV.D                                       |
| 3     | ADDI R1,R1,8   | 31-33                       | 32-32               |                  | 33                                    | 62                   |                                                               |
| 3     | SGTI R3,R1,80  | 51-55                       | 34-34               |                  | 35                                    | 63                   | I waits R1 from 3/ADDI                                        |
| 3     | BEQ R3,R0,et:  | ie 32-36                    | 36-36               |                  | $\sim$                                | 64                   | I waits R3 from 3/SGTI                                        |