> It should include your 4 pages report in pdf which includes how your assembly language program works, as well as the machine codes and microarchitecture design to support MLA and MUL instructions.
> The archive file should include the assignment project with your written .asm code, the 3 libraries (MCU, CMSIS and Baseboard) in the workspace in a zip file.
> Ensure that you name your file in the following format:
Assignment1_(Lab Day)_(Grp No.)_(Matric #1)_(Matric No#2)_Report.pdf
Assignment1_(Lab Day)_(Grp No.)_(Matric #1)_(Matric No#2)_Report.zip
>Write your names and matriculation numbers, and the name of Graduate Assistant (GA) supervising and assessing your group, clearly on the top left corner of the first page of your report.
# CG2028 Lab Assignment Report
#### Team B02-1: Fu Tianyuan, A0177377M and Zhang Yihan, A0177861R
## Abstract
Classification is a common technique in the engineering domain and the performance of a certain classification algorithm can be evaluated by the Probability of Detection for each class m, "pdm" in short. In this project, we implemented a method in ARM assembly to compute the pdm value on the NXP LPC1769 ARM Cortex-M3 SoC on the LPCXpresso board.
## Algorithm Description
The algorithm for computing the pdm given a confusion matrix CM, the (i, j) value of which indicates the number of instances when patterns whose actual class is class i and have been classified as class j. The elements in the diagonal, whose location can be described by (m, m), is the number of results that are classified correctly in the class m. The pseudo code of the algorithm is like this:
```cal
pmd(confusion_matrix[m][m]) {
For each line in the matrix {
Var sum = 0 // initiate the sum for the line with value 0
For each element in the line {
Add the value of the element to sum
} // sum is updated with the summation of the element values in this line
Var pdm = number of correct classification / sum
Print pdm
}
}
```
## Algorithm Implementation
#### Notations used in our implementation:
>calling from the c function: pdm(M, (int*)pd, (int*)CM, scale);
R0: M = 4;
R1: Address of pd;
R2: Address of CM;
R3: scale = 100000;
R4: row_i, row index
R5: col_i; column index
R6: summ, sum per row
R7: CM element
R8: division intermediate result
R9: number of correctly classified
#### The implementation of the pdm function in ARM assembly language:
```assembly
pdm:
PUSH {R0, R1, R2, R3, R4, R5, R6, R7, R8, R9}
mov R4, #0
outerloop:
cmp R4, R0
bge done
mov R5, #0
mov R6, #0
innerloop:
cmp R5, R0
bge innerloopdone
@<sum and devide>
ldr R7, [R2], #4
add R6, R6, R7
@<done sum and devide>
cmp R4, R5
IT EQ
moveq R9, R7
add R5, R5, #1
b innerloop
innerloopdone:
mul R9, R9, R3
sdiv R8, R9, R6
str R8, [R1]
add R1, R1, #4
add R4, R4, #1
b outerloop
done:
POP {R0, R1, R2, R3, R4, R5, R6, R7, R8, R9}
BX LR
```
#### Brief description of implementation
At the beginning of the execution, we put all original values of the registers that we used in the execution to the stack and resume the values at the end of the execution. This is done for keeping the status and data of other threads of the processor. Simplying overwriting the valules would cause hardware error while resuming.
The basic structure of the implementation is a 2-level nested loop: outerloop and inner loop in the program. At the beginning of each loop, we compare the value in the R0 with R4/R5, representing the index of line and conlunmn respectively. We use the flag 'bge' to check the if the loop condition is met.
At the begining of the outer loop, we set R5(the column index) and R6(initial sum for the current line), to be 0 for inner loop processing. After the end of inner loop, the R6 has been updated with the actual sum of all the elements. As assembly language do not have the support for floating data type, if we want to conserve the digits after the decimal point in the decimal representation, we have to shift the digits to positions before the point, i.e. scale the value by a factor. In our case, we use 100000 as the factor which is stored in R3. We multiply R3 with R9(the number of correctly classified elements), store in the R9 and devide it with R6(sum of the line). As a result we get the scaled probability of detection. The conversion back to the unscaled value will be done in the C driver program. Afterwards, we have to store the pdm of the line to the cooresponding memroy address. As str operation does not have offsetting, we have to add 4(32bit length processor / 8bit) to the memory location. Finally, add 1 to the line number for the next iteration.
In the inner loop, we first check if the current column index (R5) is reaching M using 'bge' flag, to decide whether to terminate the looping. R2 contains the address of serialized confusion matrix in the memory. To process each value, we need another register(R7) to load the element from the memory address. Offset addressing for R2 is used here, and will update R2 with 4 after loading once from R2. Then we update the value in R7 to R6(sum). During the iteration, we can get the element of correct classification by comparing the indexes of row(R4) and column(R5). If happened to be the same, record it in R9 for later use. At the end of the inner loop, we would update the index for column by 1 for the next iteration.
## Machine Code
Operation | Machine Code
--------------------| ------------
ldr R7, [R2], #4 | 0b0000 01 001001 0010 0111 0000 00000100 => **0x04927004**
add R6, R6, R7 | 0b0000 00 0 0000 0 0110 0110 0000000 0 0111 => **0x00066007**
add R5, R5, #1 | 0b0000 00 0 0000 0 0101 0101 0000 00000001 => **0x00055001**
b innerloop | 0b1110 10 000000 000000000000 00110000 => **0xF8000030**
add R1, R1, #4 | 0b0000 00 0 0000 0 0001 0001 0000 00000100 => **0x00011004**
add R4, R4, #1 | 0b0000 00 0 0000 0 0100 0100 0000 00000001 => **0x00044001**
b outerloop | 0b1110 10 000000 000000000000 01110100 => **0xF8000074**
## MLA & MUL Function Design
The microarchitecture of the new design supporting MLA and MUL operations (Structures without difference omitted):

#### Brief description of implementation
- In order for the Register File to read from Rs simultaneously, we added a new read port and correspondingly a new output port.
- Output of [Rm] and [Rs] will be passed to the Multiplier Block, where the process of multiplication (Rm * Rs) takes place.
- Following the Register File, two multiplexers are added. The one controled by 'MLA' ensures that Rn will be added to the result if operation is MLA, while 0 will be added if not. 'MUL' ensures the correct source is chosen between RD2 and the result from the Multiplier Block.
#### Control Unit Design
| Control Units | Explanation |
| ------------- | ----------- |
| MLA = (op == 00)&&(S == 0)&&(L == 0)&&(M == 1)&&(cmd == 0b0001) | Asserted for MLA operation only |
| MUL = (op == 00)&&(S == 0)&&(L == 0)&&(M == 1) | Asserted for MUL and MLA operation |
| ALUControl = (op == 00) ? **(cmd == 0000) ? 0100 : ((cmd == 0001) ? 0100 : cmd)** : (U?0100:0010) | ALU should be performing addition for both MLA and MUL operations |
The rest control units remain unchanged.