# Lista 8, 14 grudnia 2021
###### tags: `ak21` `ćwiczenia`
## Deklaracje
Gotowość rozwiązania zadania należy wyrazić poprzez postawienie X w odpowiedniej kolumnie! Jeśli pożądasz zreferować dane zadanie w trakcie dyskusji (co najwyżej jedno!) oznacz je znakiem ==X== na żółtym tle.
**UWAGA: Tabelkę wolno edytować tylko wtedy, gdy jest na zielonym tle!**
:::danger
| Imię i nazwisko | 5.9 |5.11 | 6.1 | 6.2 | 6.3 | 6.4 |
| ---------------------:| --- | --- | --- | --- | --- | --- |
| Patrycja Balik | X | X | X | X | X | X | 6
| Michał Błaszczyk | X | X | X | . | | . | 4
| Zbigniew Drozd | | | | | | | 0
| Maciej Dudek | X | X | X | X | X | X | 6
| Bogusz Kaszowski | | | | | | | 0
| Aleksandra Kosińska | | | X | X | X | X | 4
| Jakub Marcinkowski | X | X | | | | | 2
| Kuba Nowak | X | X | X | X | X | X | 6
| Antoni Nowak | X | X | | | | | 2
| Krzysztof Obłonczek | X |==X==| X | X | X | X | 6
| Michał Opanowicz | X | X | X | X | | | 4
| Wiktor Pilarczyk | X | X | . | . | . | . | 4
| Cezary Stajszczyk | | | X | X | X | X | 4
| Bartosz Szpila | X | X | X | X | | | 4
| Gabriel Sztorc | | | | | | |
| Jakub Urbańczyk | X | X | . | . | . | . | 4
| Konrad Werbliński | X | X | X | X | | | 4
| Franciszek Zdobylak | X | X | . | . | . | | 3.5
:::
## Zadanie 5.9
:::info
Autor: Antoni Nowak
:::

| Cycle | Memory | Integer | Control | Float |
| :--------: | :--------: | :--------: | :---: | :--: |
|1| 1 LD R0,[R0] | | | |
|2| 2 LD R1,[R1] | | | |
|3| 5 LD R6,[R2] | | | |
|4| 6 Ld R2,[R0] | | | |
|5| | 3 IADD R4,R0,R1 |||
|6| ||| 7 FADD R3,R1,R6|
|7| | 8 IADD R4,R2,R4 || 4 FADD R5,R0,R4 |
|8| | 10 IADD R0,R6,R2 |||
|9| | | | |
|10|| 11 IADD R0,R0,R3 |||
|11|| 9 IADD R5,R5,R4 |||
|12|||||
|13||| 12 BEQ LOOP,R0,R5||
$$
\frac{12}{13*4}=\frac{3}{13}=23%
$$
## Zadanie 5.11
:::info
Autor: Krzysztof Obłonczek
:::
https://webwhiteboard.com/board/3SE2MjS8iia50cL2ZKelgaXHrh0WbXs8/


a) c0 - 1 cykl
c1 - 4 cykl
c2 - 5 cykl
c3 - 6 cykl
c4 - 7 cykl
c5 - 8 cykl
c6 - 9 cykl
c7 - 10 cykl
b) 3 + 3N = 3(N+1)

## Zadanie 6.1
:::info
Autor: Aleksandra Kosińska
:::
```c=
for (i = 0; i < 100; i ++)
A[i] = ((B[i] * C[i]) + D[i])/2;
```
(a) Translate this code into assembly language using the following instructions in the ISA (note the number of cycles each instruction takes is shown next to each instruction):

Assume one memory location is required to store each element of the array. Also assume that there are eight register, R0 to R7.
Condition codes are set after the execution of an arithmetic instruction. You can assume typically available condition codes such as zero (EQZ), positive (GTZ), negative (LTZ), non-negative (GEZ), and non-positive (LEZ).
```
MOVI R1 , 99
LEA R0 , A
LEA R2 , B
LEA R3 , C
LEA R4 , D
LOOP :
LD R5 , R2 , R1
LD R6 , R3 , R1
MUL R7 , R5 , R6
LD R5 , R4 , R1
ADD R6 , R7 , R5
RSHFA R7 , R6 , 1
ST R7 , R0 , R1
ADD R1 , R1 , -1
BRGEZ R1 LOOP
```
How many cycles does it take to execute the program?
```
przed LOOP: 1 + 1 + 1 + 1 + 1 = 5
LOOP : 11 + 11 + 6 + 11 + 4 + 1 + 11 + 4 + 1 = 60
5 + 100 * 60 = 6005
```
(b) Now write Cray-like vector assembly code with the minimum number of instructions. Assume that there are eight vector registers and the length of each vector register is 64. Use the following instructions in the vector ISA:

```
SET Vln , 50
SET Vst , 1
VLD V1 , B
VLD V2 , C
VMUL V4 , V1 , V2
VLD V3 , D
VADD V6 , V4 , V3
VRSHFA V7 , V6 , 1
VST V7 , A
VLD V1 , B + 50
VLD V2 , C + 50
VMUL V4 , V1 , V2
VLD V3 , D + 50
VADD V6 , V4 , V3
VRSHFA V7 , V6 , 1
VST V7 , A + 50
```
How many cycles does it take to execute the program on the following processors? Assume that memory is 16-way interleaved.
**Vector processor without chaining, 1 port to memory (1 load or store per cycle):**
```
SET | 1 |
SET | 1 |
VLD | 11 | 49 |
VLD | 11 | 49 |
VMUL | 6 | 49 |
VLD | 11 | 49 |
VADD | 4 | 49 |
VRSHFA | 1 | 49 |
VST | 11 | 49 |
1 + 1 + 11 + 49 + 11 + 49 + 1 + 11 + 49 + 4 + 49 + 1 + 49 + 11 + 49 = 346
11 + 49 + 11 + 49 + 1 + 11 + 49 + 4 + 49 + 1 + 49 + 11 + 49 = 344
346 + 344 = 690
```
**Vector processor with chaining, 1 port to memory:**
```
SET | 1 |
SET | 1 |
VLD | 11 | 49 |
VLD | 11 | 49 |
VMUL | 6 | 49 |
VLD | 11 | 49 |
VADD | 4 | 49 |
VRSHFA | 1 | 49 |
VST | 11 | 49 |
240 + 242 = 482
```
**Vector processor with chaining, 2 read ports and 1 write port to memory:**
```
SET |1|
SET |1|
VLD |11| 49 |
VLD |1|11| 49 |
VMUL |6| 49 |
VLD |11| 49 |
VADD |4| 49 |
VRSHFA |1| 49 |
VST |11| 49 |
VLD |1|11| 49 |
VLD |11| 49 |
VMUL |6| 49 |
VLD |11| 49 |
VADD |4| 49 |
VRSHFA |1| 49 |
VST |11| 49 |
1 + 1 + 11 + 49 + 11 + 4 + 1 + 1 + 11 + 49 + 11 + 4 + 1 + 11 + 49 = 215
```
## Zadanie 6.2
:::info
Autor: Cezary Stajszczyk
:::
**a)** What is the minimum power-of-two number of banks required in order for memory accesses to never stall? (Assume a vector stride of 1.)
```
64
```
**b)** The machine (with as many banks as you found in part a) executes the following program (assume that the vector stride is set to 1):
```
VLD V1, A
VLD V2, B
VADD V3, V1, V2
VMUL V4, V3, V1
VRSHFA V5, V4, 2
```
It takes 111 cycles to execute this program. What is the vector length L (i.e., the number of elements in a vector)?
```
|---- 50 ------|--- (L-1) ----|
|1|---- 50 ------|--- (L-1) ----|
|- 4 -|--- (L-1) ----|
|- 16 -|--- (L-1) ----|
|1|--- (L-1) ----|
1 + 50 + 4 + 16 + 1 + (L − 1) = 71 + L = 111
L = 40
```
If the machine did not support chaining (but could still pipeline independent operations), how many cycles would be required to execute the same program?
```
|---- 50 ------|--- (L-1) ----|
|1|---- 50 ------|--- (L-1) ----|
|- 4 -|--- (L-1) ----|
|- 16 -|--- (L-1) ----|
|1|--- (L-1) ----|
(L-1)*4 + 1 + 50 + 4 + 16 + 1 = 228 cycles
```
**c)** The architect of this machine decides that she needs to cut costs in the machine’s memory system. She reduces the number of banks by a factor of 2 from the number of banks you found in part (a) above. Because loads and stores might stall due to bank contention, an arbiter is added to each bank so that pending loads from the oldest instruction are serviced first. How many cycles does the program take to execute on the machine with this reduced-cost memory system (but with chaining)?

```
VLD:
|---- 50 ------|
|1|---- 50 ------|
|1|1|---- 50 ------|
...
|---- 31 ----|---- 50 ------|
|---- 50 ------|
|1|---- 50 ------|
|1|1|---- 50 ------|
...
|- 7 -|---- 50 ------|
VLD:
|1|---- 50 ------|
|1|1|---- 50 ------|
|1|1|1|---- 50 ------|
...
|---- 32 ----|---- 50 ------|
|---- 50 ------|
|1|---- 50 ------|
|1|1|---- 50 ------|
...
|- 7 -|---- 50 ------|
VADD |- 4 -|--- (L-1) ----|
VMUL |-- 16 --|--- (L-1) ----|
VRSHFA |1|--- (L-1) ----|
(1 + 2*50 + 7) + 4 + 16 + 1 + (L-1) = 168 cycles
{ load }
```
Now, the architect reduces cost further by reducing the number of memory banks (to a lower power of 2). The program executes in 279 cycles. How many banks are in the system?

16 banks:
```
VLD:
|---- 50 ------|
|1|---- 50 ------|
|1|1|---- 50 ------|
...
|--- 15 ---|---- 50 ------|
|---- 50 ------|
|1|---- 50 ------|
|1|1|---- 50 ------|
...
|--- 15 ---|---- 50 ------|
|---- 50 ----|
|1|---- 50 ----|
|1|1|---- 50 ----|
...
|- 7 -|---- 50 ----|
VLD:
|1|---- 50 ------|
|1|1|---- 50 ------|
|1|1|1|---- 50 ------|
...
|--- 16 ---|---- 50 ------|
|---- 50 ------|
|1|---- 50 ------|
|1|1|---- 50 ------|
...
|--- 15 ---|---- 50 ------|
|---- 50 ----|
|1|---- 50 ----|
|1|1|---- 50 ----|
...
|- 7 -|---- 50 ----|
VADD: |- 4 -|--- (L-1) ----|
VMUL: |-- 16 --|--- (L-1) ----|
VRSHFA: |1|--- (L-1) ----|
(1 + 3*50 + 7) + 4 + 16 + 1 + (L-1) = 218 cycles
```
**d)** Another architect is now designing the second generation of this vector computer. He wants to build a multicore machine in which 4 vector processors share the same memory system. He scales up the number of banks by 4 in order to match the memory system bandwidth to the new demand. However, when he simulates this new machine design with a separate vector program running on every core, he finds that the average execution time is longer than if each individual program ran on the original single-core system with 1/4 the banks. Why could this be? Provide concrete reason(s).
```
Conflicts, ports number
```
## Zadanie 6.3
:::info
Autor: Maciej Dudek
:::
a)
VLD ma poóźnienie 100 instrukcji. Przy stride-dzie 1, minimalną wartością zmiennej N jest 100, ponieważ pierwszy doczyt wląduje w banku A, następny w (A+1)%N. Odczyty od 1 do 100 trafią do różnych banków, natomiast odczyt 101 trafi spowrotem do banku A, on już skończył obsługiwanie odczytu 1, więc jest gotowy na 101.
b)
Nie musimy brać 2 razy więcej banków, wystarczy, że weźmiemy ich conajmniej 100 i jednocześnie względnie pierwsze z 2. Dla takiego warunku N = 101 jest dobrym wyborem.
c)
Z założenia zadania mamy, że operacje odczytu i zapisu się nie blokują, oraz jednostki nie posiadają hcain-owania, tak więc całkowity czas wykonania jest sumą czasu spędzonego na każdej z instrukji.
```
[ 100 | M - 1 ]
[ M ]
[ 5 | M - 1 ]
[ 10 | M - 1 ]
[ 100 | M - 1]
```
$$
4306 = 100 + 2M - 1 + 5 + M - 1 + 10 + M - 1 + 100 + M - 1, \\
4306 = 5M + 211, \\
819 = M
$$
d)
```
[ 100 | M - 1 ]
[ M ]
[ 5 | M - 1 ]
[ 10 | M - 1 ]
[ 100 | M - 1]
```
$$
T = 3M + 99 + 5 + 10 + 99 \\
T = 3M + 213 \\
T = 2883
$$
## Zadanie 6.4
:::info
Autor: Patrycja Balik
:::
> A vector processor includes the following instructions in its ISA:
> 
> Assume that:
> * The machine has an in-order pipeline.
> * Each vector element is 64 bits in size.
> * In order to support 1-element per cycle memory throughput for vector elements, after the first element in a vector, the machine interleaves vector elements across memory banks. All vectors are stored in memory with the first element mapped to bank 0, the second element mapped to bank 1, etc.
> * Memory accesses within a vectorized memory request must be issued in order.
> * Each memory bank has two ports (so that two loads/stores can be active simultaneously), and there are two load/store functional units available.
> a) The number of memory banks in this vector processor is a power of two. What should the minimum
number of banks be to avoid stalls while executing a VLD or VST instruction, assuming a vector stride
of 1? Explain.
Chcemy $2^k \geq 60$. Zatem $2^k = 64$.
> b) Translate the following piece of code, called SAXPY, into vector code that executes in the minimum possible number of cycles on this vector processor. Constant alpha is stored in vector register V1. (Note: Assume vector length register and vector stride register are already introduced to appropriate values.)
```
for (i = 0; i < VLEN; i++) {
temp = alpha * X[i];
Z[i] = temp + Y[i];
}
```
```
vld v2, X
vld v3, Y
vmul v4, v1, v2
vadd v5, v4, v3
vst v5, Z
```
> c) Consider a machine with as many banks as you found in part (a) executes the program you wrote in part (b). Assume that the vector stride is set to 1. The machine does not support chaining between vector functional units. If the machine takes 340 cycles to finish the program, what is the vector length? Show your work.
```
[ 60 | VLEN - 1 ]
[ 60 | VLEN - 1 ]
[ 16 | VLEN - 1 ]
[ 8 | VLEN - 1 ]
[ 60 | VLEN - 1]
```
$$
340 = 60 + \texttt{VLEN} - 1 + 16 + \texttt{VLEN} - 1 + 8 + \texttt{VLEN} - 1 + 60 + \texttt{VLEN} - 1, \\
340 = 4\texttt{VLEN} + 140, \\
\texttt{VLEN} = 50.
$$
> d) The architect of the machine improves the design to support chaining. For the VLEN you found in the previous part, what is the total number of cycles for the same program? Show your work.
```
[ 60 | VLEN - 1 ]
[ 60 | VLEN - 1 ]
[ 16 | VLEN - 1 ]
[ 8 | VLEN - 1 ]
[ 60 | VLEN - 1]
```
$$
60 + \max(16 + 8 + 60 + \texttt{VLEN} - 1, 49 + 60 + \texttt{VLEN} - 1) = 218.
$$
> e) Now the architect decides that she needs to cut costs in the machine’s memory system by reducing the number of banks. Because loads and stores might stall due to bank contention, an arbiter is added to each bank so that pending loads from the oldest instruction are serviced first. She observes the program
now takes 481 cycles (with chaining). What is the new number of banks? Show your work. (Note: Recall the number of banks is a power of two.)
Możliwości: 1, 2, 4, 8, 16, 32.
Elementy wektorów numerujemy od 0: X0, X1, ...
Dla 32 będzie nam przykro gdy operacje `load`/`store` będą przy 32 elemencie zabierać się za bank #0. 32 daje się wykluczyć, bo wychodzi zdecydowanie za mało cykli.
Dla 16 jest smutno w elementach 16, 32, 48, i mamy coś takiego:
```
X0-15, Y0-15 żyją normalnie.
[ 60 | 15 ]
[ 60 | 15 ]
X16 nie może się władować do banku 0 tak łatwo.
[ 60 | 15 ]
|[ 60 | 15 ]
X32 |
| [ 60 | 15 ]
| |[ 60 | 15 ]
X48 | | |
| | [ 60 | 1 ]
| | |[ 60 | 1 ]
MUL [ 16 | 15 ] |
[ 16 | 15 ]
[ 16 | 15 ]
[ 16 | 1 ]
ADD [ 8 | 15 ] |
[ 8 | 15 ]
[ 8 | 15 ]
| [ 8 | 1 ]
ST [ 60 | 15 ]
[ 60 | 15 ]
[ 60 | 15 ]
[ 60 | 1 ]
```
Wychodzi
$$
(60 + 60 + 60 + 60) + (60 + 60 + 60 + 60 + 1) = 481.
$$