# Annotate and explain Quiz 1 to 5
> 曹為廷
## Quiz 1
### Problem A
You are required to write C programs that convert between different floating-point formats. One of these formats is the half-precision floating-point, also known as `FP16` or `float16`, which is a compact representation designed to save memory. Half-precision numbers are increasingly used in fields like deep learning and image processing. Unlike the single-precision floating-point format (`float32`), defined by IEEE 754, FP16 has fewer bits allocated for the exponent.
```
┌ sign
│
│ ┌ exponent
│ │
│ │ ┌ mantissa
│ │ │
│┌─┴─┐┌─┴──────┐
0b0000000000000000 IEEE binary16
```
The functions for converting between FP16 and float32 (FP32) are provided below.
```c
static inline float bits_to_fp32(uint32_t w)
{
union {
uint32_t as_bits;
float as_value;
} fp32 = {.as_bits = w};
return fp32.as_value;
}
static inline uint32_t fp32_to_bits(float f)
{
union {
float as_value;
uint32_t as_bits;
} fp32 = {.as_value = f};
return fp32.as_bits;
}
```
Consider the `fp16_to_fp32` function, which converts an FP16 value to a single-precision IEEE floating-point (float32) format.
```c
static inline float fp16_to_fp32(fp16_t h)
{
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t two_w = w + w;
const uint32_t exp_offset = UINT32_C(0xE0) << 23;
const float exp_scale = 0x1.0p-112f;
const float normalized_value =
bits_to_fp32((two_w >> 4) + exp_offset) * exp_scale;
const uint32_t mask = UINT32_C(126) << 23;
const float magic_bias = 0.5f;
const float denormalized_value =
bits_to_fp32((two_w >> 17) | mask) - magic_bias;
const uint32_t denormalized_cutoff = UINT32_C(1) << 27;
const uint32_t result =
sign | (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value)
: fp32_to_bits(normalized_value));
return bits_to_fp32(result);
}
```
Next, complete the `fp32_to_fp16` function, which is responsible for converting a float32 value to FP16. The identifiers starting with `#A` indicate the sections where you need to provide the appropriate answers.
```c
static inline fp16_t fp32_to_fp16(float f)
{
const float scale_to_inf = 0x1.0p+112f;
const float scale_to_zero = 0x1.0p-110f;
float base = (fabsf(f) * scale_to_inf) * scale_to_zero;
const uint32_t w = fp32_to_bits(f);
const uint32_t shl1_w = w + w;
const uint32_t sign = w & UINT32_C(0x80000000);
uint32_t bias = shl1_w & UINT32_C(0xFF000000);
if (bias < UINT32_C(0x71000000))
bias = UINT32_C(0x71000000);
base = bits_to_fp32((bias >> 1) + UINT32_C(0x07800000/*A01*/)) + base;
const uint32_t bits = fp32_to_bits(base);
const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00);
const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF/*A02*/);
const uint32_t nonsign = exp_bits + mantissa_bits;
return (sign >> 16) |
(shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00/*A03*/) : nonsign);
}
```
The values for `#A01`, `#A02`, and `#A03` must be expressed as hexadecimal integer literals in their shortest form. For example, use `0x7C00` rather than `0x007C00`.
A01 = `0x7800000`
A02 = `0xFFF`
:::warning
A02 = `0x7FF`~`0x7FFFFF` also work
As long as extract the rightmost 11 bits without including the exponent and sign parts, it will still work correctly.
:::
A03 = `0x7E00`
#### 📝 Annotate
:::info
First, use `scale_to_inf` and `scale_to_zero` to detect whether the value `f` is outside the representable range of **FP16**.
For example:
* **A very small number**
After scaling, due to being too small, the bias will be forced to `0x71000000`, which will ultimately be converted to 0 (underflow).
* **A very large number**
After scaling, it will become `INF` (overflow). Finally, through the condition `(shl1_w > UINT32_C(0xFF000000))`, it will return **FP16** infinity `0x7E00`.
Next, let `w` be the 32-bit integer representation of the float `f`, and let `shl1_w` be the value of `w` left-shifted by one bit.
Then, extract the most significant bit of `w` as the `sign`.
The `bias` is obtained by extracting the leftmost 8 bits of `shl1_w`, which corresponds to the exponent part of `f`.
The reason why the minimum value of bias is `0x71000000` is as follows:
In **FP32**, the exponent bias is `127`, while in **FP16**, the exponent bias is `15`. When converting a 32-bit floating-point number to a 16-bit floating-point number, we need to adjust the exponent.
Specifically, we subtract (127 - 15), which is 112, from the **FP32** exponent. This is why the bias is set to no less than `0x71000000` (which equals 113 << 24). This value represents the smallest non-zero exponent that can be represented in **FP16**.
The value `0x07800000` corresponds to the exponent bias of 15 (the **FP16** bias) shifted left by 23 bits.
Thus, the expression `(bias >> 1) + UINT32_C(0x07800000)` can be understood as:
* Adding the FP16 bias (15) to the exponent part of bias.
The operation `base_new = bits_to_fp32((bias >> 1) + UINT32_C(0x07800000)) + base_old` effectively calculates:
* $$ base_{new} = 2 ^ {original exponent+15} + base_{old} $$
According to the calculation above, the exponent of `base_new` is equal to the exponent of `f` + 15.
Then, based on the rules of floating-point addition, the mantissa of `base_new` becomes the mantissa of `f` right-shifted by (15 - 2) bits (including rounding).
(15-2 is because during scaling, `base` is 4 times larger than `f`, so it is right-shifted 2 bits less .)

`exp_bits` takes bits 28 to 24 of base_new as the exponent for the FP16 format and is shifted right to align with bits 15 to 11.
Next, `mantissa_bits` takes the rightmost 10 bits of `base_new` as the mantissa for the FP16 format.
The reason for taking 1 extra bit for `mantissa_bits` is that the exponent has a bias, which compensates for this extra bit, ensuring proper alignment with exp_bits.

Finally, perform a bitwise OR operation between the `sign` and `nonsign` to obtain the final FP16 representation.
:::
### Problem B
The bfloat16 format is a 16-bit floating-point representation, designed to provide a wide dynamic range by using a floating radix point. It is a shortened version of the 32-bit IEEE 754 single-precision format (binary32), aimed at accelerating machine learning.
The structure of the bfloat16 floating-point format is as follows.
```
┌ sign
│
│ ┌ exponent
│ │
│ │ ┌ mantissa
│ │ │
│┌──┴───┐┌─┴───┐
0b0000000000000000 bfloat16
```
Since bfloat16 (bf16) shares the same number of exponent bits as a 32-bit float, encoding and decoding values is relatively simple.
```c
static inline float bf16_to_fp32(bf16_t h)
{
union {
float f;
uint32_t i;
} u = {.i = (uint32_t)h.bits << 16};
return u.f;
}
```
You are required to implement the `fp32_to_bf16` function, which converts a float32 to bfloat16. This conversion should be binary identical to the BFloat16 format. Floats must round to the nearest even value, and NaNs should be quiet. Subnormal values should not be flushed to zero, unless explicitly needed.
```c
static inline bf16_t fp32_to_bf16(float s)
{
bf16_t h;
union {
float f;
uint32_t i;
} u = {.f = s};
if ((u.i & 0x7fffffff) > 0x7f800000/*B01*/) { /* NaN */
h.bits = (u.i >> 16) | 64; /* force to quiet */
return h;
}
h.bits = (u.i + (0x7fff/*B02*/ + ((u.i >> 0x10/*B03*/) & 1))) >> 0x10/*B04*/;
return h;
}
```
B01 = `0x7f800000`
B02 = `0x7fff`
B03 = `0x10`
:::warning
Why is it 16 instead of 15?


:::
B04 = `0x10`
#### 📝 Annotate
:::info
The largest representable number in FP32 is infinity, with a hexadecimal representation of `0x7f800000`.
If the binary representation exceeds 0x7f800000, it indicates that the exponent of `s` is `11111111`, and the mantissa is not all zeros. In this case, the value is forced to be a **quiet NaN (qNaN)** and returned.
(A quiet NaN should have the highest bit of the mantissa set to 1, so 64 is forcibly OR-ed with the mantissa to ensure this.)
`((u.i >> 16) & 1)` is used to check whether the 17th bit of `s` is 1, in order to determine whether rounding is necessary.
`0x7fff + ((u.i >> 16) & 1)`:
* If rounding up is required, the result becomes `0x8000`.
* Otherwise, it remains `0x7fff`.
Add this result to `u.i`, then right-shift the sum by 16 bits to extract the upper 16 bits as the final result.
:::
### Problem C
In Problem A, the `fabsf` function was used within the `fp32_to_fp16` function. Now, you are required to implement your own version of `fabsf`.
```c
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; /*C01*/ // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
C01 = `0x7FFFFFFF`
#### 📝 Annotate
:::info

Mask the highest bit to obtain the absolute value.
:::
Consider the C implementation of `__builtin_clz`:
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
The `__builtin_clz` function is a GCC built-in function that counts the number of leading zero bits in the binary representation of an unsigned integer, starting from the most significant bit (MSB). It returns an integer representing how many zero bits precede the first `1` in the binary representation of the number.
For example, given the number `x`, `__builtin_clz(x)` counts how many 0s are at the beginning of the binary form of `x`.
Examples of `__builtin_clz`
1. **Example 1**: For `x = 16`
* Binary representation: `00000000 00000000 00000000 00010000`
* The number of leading zero bits is `27` (as there are 27 zeros before the first `1`).
* Output: `__builtin_clz(16)` returns `27`.
2. **Example 2**: For `x = 1`
* Binary representation: `00000000 00000000 00000000 00000001`
* The number of leading zero bits is `31`.
* Output: `__builtin_clz(1)` returns `31`.
3. **Example 3**: For `x = 255`
* Binary representation: `00000000 00000000 00000000 11111111`
* The number of leading zero bits is `24`.
* Output: `__builtin_clz(255)` returns `24`.
4. **Example 4**: For `x = 0xFFFFFFFF` (which is `4294967295` in decimal)
* Binary representation: `11111111 11111111 11111111 11111111`
* There are no leading zeros.
* Output: `__builtin_clz(0xFFFFFFFF)` returns `0`.
5. **Example 5**: For `x = 0` (undefined behavior)
* Binary representation: `00000000 00000000 00000000 00000000`
* Since `x = 0`, the result is **undefined** according to the GCC manual.
* Output: **undefined**.
Next, you will implement the `fp16_to_fp32` function, which converts a 16-bit floating-point number in IEEE half-precision format (bit representation) to a 32-bit floating-point number in IEEE single-precision format (bit representation). This implementation will avoid using any floating-point arithmetic and will utilize the `clz` function discussed earlier.
```c
static inline uint32_t fp16_to_fp32(uint16_t h) {
/*
* Extends the 16-bit half-precision floating-point number to 32 bits
* by shifting it to the upper half of a 32-bit word:
* +---+-----+------------+-------------------+
* | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
* +---+-----+------------+-------------------+
* Bits 31 26-30 16-25 0-15
*
* S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits.
*/
const uint32_t w = (uint32_t) h << 16;
/*
* Isolates the sign bit from the input number, placing it in the most
* significant bit of a 32-bit word:
*
* +---+----------------------------------+
* | S |0000000 00000000 00000000 00000000|
* +---+----------------------------------+
* Bits 31 0-31
*/
const uint32_t sign = w & UINT32_C(0x80000000);
/*
* Extracts the mantissa and exponent from the input number, placing
* them in bits 0-30 of the 32-bit word:
*
* +---+-----+------------+-------------------+
* | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
* +---+-----+------------+-------------------+
* Bits 30 27-31 17-26 0-16
*/
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
/*
* The renorm_shift variable indicates how many bits the mantissa
* needs to be shifted to normalize the half-precision number.
* For normalized numbers, renorm_shift will be 0. For denormalized
* numbers, renorm_shift will be greater than 0. Shifting a
* denormalized number will move the mantissa into the exponent,
* normalizing it.
*/
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
/*
* If the half-precision number has an exponent of 15, adding a
* specific value will cause overflow into bit 31, which converts
* the upper 9 bits into ones. Thus:
* inf_nan_mask ==
* 0x7F800000 if the half-precision number is
* NaN or infinity (exponent of 15)
* 0x00000000 otherwise
*/
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
/*
* If nonsign equals 0, subtracting 1 will cause overflow, setting
* bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result
* propagates bit 31 across all bits in zero_mask. Thus:
* zero_mask ==
* 0xFFFFFFFF if the half-precision number is
* zero (+0.0h or -0.0h)
* 0x00000000 otherwise
*/
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
/*
* 1. Shifts nonsign left by renorm_shift to normalize it (for denormal
* inputs).
* 2. Shifts nonsign right by 3, adjusting the exponent to fit in the
* 8-bit exponent field and moving the mantissa into the correct
* position within the 23-bit mantissa field of the single-precision
* format.
* 3. Adds 0x70 to the exponent to account for the difference in bias
* between half-precision and single-precision.
* 4. Subtracts renorm_shift from the exponent to account for any
* renormalization that occurred.
* 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input
* was NaN or infinity.
* 6. ANDs with the inverted zero_mask to set the mantissa and exponent
* to zero if the input was zero.
* 7. Combines everything with the sign bit of the input number.
*/
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70/*C02*/ - renorm_shift) << 23/*C03*/)) | inf_nan_mask) & ~zero_mask);
}
```
C02 = `0x70`
C03 = `0x17`
#### 📝 Annotate
:::info
The variables `w`, `sign`, and `nonsign` are explained in detail in the comments.
`renorm_shift` calculates the number of leading zeros. If this number exceeds `5`, it indicates that `h` is denormalized. In such cases, converting to **FP32** format increases the representable range, so re-normalization is required.

If the exponent part of `nonsign` is all ones (`0x7F8XXXX`, representing NaN or infinity),
`(nonsign + 0x04000000) >> 8` will yield `0xFF800000`.
The `inf_nan_mask` will ultimately be `0x7F800000` for these cases (NaN or infinity). Otherwise, it will be `0x00000000`.
If `nonsign` is 0, subtracting 1 results in an overflow, yielding `0xFFFFFFFF`.
Final steps:
1. `nonsign << renorm_shift >> 3`
This adjusts the mantissa of an **FP16** denormalized number by left-shifting it to match the position of the mantissa in an FP32 normalized number.

2. `(0x70 - renorm_shift) << 23`
* The difference between the FP32 bias and the FP16 bias is `127 - 15 = 112 = 0x70`.
* Subtracting `renorm_shift` compensates for the left shift performed in step 1.
* The result is then left-shifted by 23 bits to place it in the exponent part of the FP32 format.
3. Adding the results of steps 1 and 2
* This produces the correct FP32 representation with the adjusted exponent and mantissa.
* Finally, check for special cases such as infinity (INF), NaN, and zero. Once these are handled, add the sign bit to complete the conversion.
:::
### Problem D
The C function is_aligned determines whether a given pointer ptr is aligned to the specified alignment. It returns 1 if the pointer is properly aligned and 0 otherwise.
```c
bool is_aligned(void *ptr, size_t alignment) {
return ((uintptr_t) ptr % alignment) == 0;
}
```
If the alignment is guaranteed to be a power of 2, the modulo operation can be optimized using a bitwise-AND operation.
You are expected to write a function `align_ptr` that aligns a pointer to the next address boundary based on the specified alignment. This function adjusts the given pointer to the next memory address that is a multiple of the provided alignment (which must be a power of 2, such as 1, 2, 4, or 8). If the pointer is already aligned, the original pointer will be returned. The alignment operation is performed using bitwise arithmetic without any conditional checks, making the function efficient.
```c
static inline uint8_t *align_ptr(const uint8_t *ptr, uintptr_t alignment) {
alignment--; /* Adjust alignment to create a mask for the bitwise operation */
return (uint8_t *) (((uintptr_t) ptr + alignment) & ~alignment /*D01*/ );
}
```
D01 = `~alignment`
#### 📝 Annotate
:::info
Assuming `alignment` = `8`. decrementing `alignment--` changes it to `7` (`0b0111`).
* Adding `7` to `ptr` ensures there is enough space for alignment.
* Using `& ~7` (`0b1111...1000`) clears the lower bits of the pointer, effectively aligning the address to the nearest multiple of `8`.
:::
Assuming that `uintptr_t` represents the native word size of the platform (which is typically 4 or 8 bytes), you are required to implement your own version of `memcpy` that prevents word-sized memory accesses at unaligned addresses, thus avoiding potential performance penalties or misaligned access problems. This implementation is more efficient than byte-by-byte copying, especially for larger memory blocks, as it utilizes word-sized memory transfers where possible.
```c
#include <stddef.h>
#include <stdint.h>
void *my_memcpy(void *dest, const void *src, size_t n) {
uint8_t *d = (uint8_t *)dest;
const uint8_t *s = (const uint8_t *)src;
/* Copy byte-by-byte until destination is word-aligned */
while (n > 0 && ((uintptr_t)d % sizeof(uintptr_t)) != 0) {
*d++ = *s++; /*D02*/
n--;
}
/* Copy word-by-word */
uintptr_t *d_word = (uintptr_t *)d;
const uintptr_t *s_word = (const uintptr_t *)s;
while (n >= sizeof(uintptr_t)) {
*d_word++ = *s_word++; /*D03*/
n -= sizeof(uintptr_t);
}
/* Copy any remaining bytes */
d = (uint8_t *)d_word;
s = (const uint8_t *)s_word;
while (n > 0 /*D04*/ ) {
*d++ = *s++; /*D05*/
n--;
}
return dest;
}
```
D02 = `*d++ = *s++`
D03 = `*d_word++ = *s_word++`
D04 = `n > 0` OR `n` OR `n != 0`
D05 = `*d++ = *s++`
#### 📝 Annotate
:::info
First, copy byte by byte until the target address is aligned to the word boundary (4- or 8-byte alignment):
* Copy the content: `*d = *s`
* Move the pointers: `d++, s++`
Second, Copy in word-sized chunks (4 or 8 bytes)
* Use word pointers for efficient copying: `*d_word++ = *s_word++`
* This step copies an entire word at a time, improving performance.
Final, handle leftover bytes:
* Copy any remaining bytes that are less than a word.
* The logic is the same as in the first step.
:::
### Problem E
Examine a C program that includes multiple memory bugs, categorized as shown in the following table.
| **ID** | **Memory bug** |
| :-- | :-- |
| A | Dangling Pointer |
| B | Mismatched Memory Management Functions |
| C | Use-After-Fre |
| D | Invalid Free |
| E | Memory Leak |
| F | Out-of-Bounds Access |
| G | Double Free |
```c
#include <stdio.h>
#include <stdlib.h>
/* Bug 1: Memory Leak */
void func1()
{
int *data = malloc(100 * sizeof(int));
if (data == NULL) {
printf("Memory allocation failed\n");
return;
}
/*E01*/
/* Memory leak: data is never freed */
}
/* Bug 2: Invalid Memory Access (Out-of-Bounds Access) */
void func2()
{
int *arr = malloc(5 * sizeof(int));
if (!arr) {
printf("Memory allocation failed\n");
return;
}
/*E02*/
/* Invalid memory access: writing beyond allocated memory */
for (int i = 0; i <= 5; i++)
arr[i] = i;
free(arr);
}
/* Bug 3: Double Free */
void func3()
{
int *ptr = malloc(10 * sizeof(int));
if (!ptr) {
printf("Memory allocation failed\n");
return;
}
free(ptr);
/*E03*/
/* Double free: freeing the same memory again */
free(ptr);
}
/* Problem 4: Dangling Pointer */
int *func4()
{
int *arr = malloc(5 * sizeof(int));
if (!arr)
return NULL;
/*E04*/
/* Memory is freed, but pointer is returned */
free(arr);
return arr;
}
/* Problem 5: Invalid Free (Freeing Stack Memory) */
void func5()
{
int local_var = 42;
/*E05*/
/* Invalid free: trying to free stack-allocated memory */
free(&local_var);
}
/* Problem 6: Mismatched Memory Management Functions */
void func6()
{
/* Memory allocated with malloc */
int *data = malloc(10 * sizeof(int));
if (!data)
return;
/*E06*/
/* Incorrect deallocation using delete (should use free) */
delete data;
}
int main()
{
/* Call each function to demonstrate the memory issues */
func1();
func2();
func3();
int *dangling_ptr = func4();
if (dangling_ptr)
printf("Dangling pointer value: %d\n", dangling_ptr[0]);
func5();
func6();
return 0;
}
```
Now, you need to identify the causes of the memory bugs by selecting from the provided table and proposing fixes. Fields `#E01`, `#E02`, `#E03`, `#E04`, `#E05`, and `#E06` must be filled in the format "ID:code", where ID corresponds to the item listed in the table. For example, 'A' stands for 'Dangling pointer,' and 'code' represents your proposed change. If the code is empty, it means the suggested action is to remove the indicated line. For instance, 'D:' indicates the memory bug is classified as 'invalid free,' and the corresponding line should be removed. Note that no spaces are allowed in `#E01`, `#E02`, `#E03`, `#E04`, `#E05`, and `#E06`.
E01 = `E:free(data);`
E02 = `F:for(i=0;i<5;i++)`
E03 = `G:`
E04 = `A:`
E05 = `D:`
E06 = `B:free(data);`
#### 📝 Annotate
:::info
The code section has already been thoroughly explained, so no additional annotate are provided here.
:::
### Problem F
You are required to write a C function that allocates nodes for a singly linked list with a specified number of nodes. The function dynamically allocates memory for the list, properly linking each node’s next pointer. The data field of each node remains uninitialized. It returns a pointer to the head of the newly created linked list, or NULL if the length is 0. The caller is responsible for freeing the allocated memory.
```c
typedef struct node {
int data;
struct node *next;
} node_t;
node_t *list_create(const int length)
{
node_t *head;
node_t **ptr = &head;
for (int i = 0; i < length; ++i) {
node_t *node = malloc(sizeof(node_t));
if (!node) {
printf("Memory allocation failed\n");
exit(1);
}
node->data = i; /* Initialize the data field for testing */
*ptr = node;
ptr = &node->next; /*F01*/
}
*ptr = NULL;
return head;
}
```
F01 = `&node->next` OR `&(node->next)`
#### 📝 Annotate
:::info
To ensure that the newly added nodes in the next iteration of the loop do not break the chain, `ptr` must pre-link `node->next`.
Memory relationship:
`ptr` → `node` → **node**
( ** )→ ( * ) → (data, next)
:::
## Quiz 2
### Problem A
A01 = `addi t1, x0, 32` OR equivalent
A02 = `bltz`
A03 = `bltz`
A04 = `shift_and_add_loop`
A05 = `shift_and_add_loop`
A06 = `andi`
A07 = `shift_and_add_loop`
A08 = `sw t0, 0(a4)`
#### 📝 Annotate
:::info
`#A01 # Set bit counter`
From the line `addi t1, t1, -1 # Decrease bit counter`, it can be inferred that the counter is stored in the register `t1`.
So `A01` = `addi t1, x0, 32` or an equivalent is used to initialize the counter `t1` with the value 32.
`#A02 a1, handle_negative1 # If multiplier negative`
If multiplier negative ( multiplier < 0 ) jump to `handle_negative1`
So, `#A02` = `bltz`
`#A03 a3, handle_negative2 # If multiplicand negative`
Same principle as `#A02`
`j, #A04 # Continue to main loop`
jump to `shift_and_add_loop:` So, `#A04` = `shift_and_add_loop`
`j, #A05 # Skip to main loop`
Same principle as `#A04`
`#A06 t2, a1, 1 # Check least significant bit`
`a1 & 0x001` can check least significant bit
So, `#A06` = `andi`
`j #A07 # Repeat loop`
Same principle as `#A04`
`#A08 # Store final result`
Store accumulator `t0` to address `a4`
So, `#A08` = `sw t0, 0(a4)`
:::
### Problem B
B01 = `-20`
B02 = `beq`
B03 = `jal`
B04 = `jal x1, hanoi`
B05 = `jalr`
B06 = `20`
B07 = `jalr`
B08 = `20`
B09 = `jalr`
B10 = `jal`
B11 = `addi a7, x0`
:::warning
or B11 = `li a7,`
:::
#### 📝 Annotate
:::info
`addi sp, sp, #B01 #Allocate stack`
Below store 5 register `x1`, `a1`, `a2`, `a3` and `a4`
Each register stores 4 bytes, so the required space is calculated as:`5×4=20`
So `#B01` = `-20`
`#B02 a1, t0, return # If there's only 1 disk, jump to return`
if (`a1` equal `t0`(`t0=1`)) jump to `return`. So `#B02` = `beq`
`#B03 x1, hanoi # Recursive call to hanoi`
Since the function needs to return after execution and the return address must be saved to `x1`, `B03` is the `jal` (Jump and Link) instruction.
`#B04 # Recursive call to hanoi`
Same principle as `#B03`
`#B05 x0, x1, 0 # Return to the caller`
`jal x1, hanoi` # Call the function: save the return address in x1
`jalr x0, x1, 0` # Return: read the address from x1 and jump back
Equivalent to the following in C:
```c
void hanoi() {
// ...
return; // Uses jalr to return
}
```
Thus, `#B05` = `jalr`.
`addi sp, sp, #B06 # Deallocate space on the stack`
Based on the instruction `lbu a3, 16(sp)`, it can be determined that `#B06` = `20`
`#B07 x0, x1, 0 # Return to the caller`
Same principle as `#B05`
`addi sp, sp, #B08 # Deallocate space on the stack`
Since `#B01` and `#B08` correspond to each other, it follows that `#B08` = `20`.
`#B09 x0, x1, 0 # Return to the caller`
Same principle as `#B05`
`#B10 x1, hanoi # Call the hanoi function`
Same principle as `#B03`
`#B11, 10 #Exit syscall`
RISC-V system call (syscall):
1. `a7` register is used to store the system call number.
2. `li` (Load Immediate) instruction is used to load an immediate value.
```
li a7, 10
ecall
```
This is equivalent to the following in C:
```
exit(0);
```
:::
### Problem C
```c
/* Iterative Tower of Hanoi Using Gray Code */
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
static void print_move(int disk, char from, char to)
{
printf("Move Disk %d from '%c' to '%c'\n", disk, from, to);
}
int main()
{
int n_disks = 3; /* Number of disks */
int total_moves = (1 << n_disks) - 1;
const char pegs[3] = {'A', 'B', 'C'}; /* Peg labels */
int pos[n_disks]; /* Disk positions: 0-A, 1-B, 2-C */
/* Initialize all disks to peg A */
for (int i = 0; i < n_disks; i++)
pos[i] = 0;
/* Set direction based on parity of number of disks */
int direction = (n_disks & 1) ? -1 /* counter-clockwise */
: 1; /* clockwise */
/* Predefined packed mapping arrays */
const uint8_t peg_map[3] = {
0x9, /* Peg A: {CCW -> C (2), CW -> B (1)} */ // #C01
0x2, /* Peg B: {CCW -> A (0), CW -> C (2)} */ // #C02
0x4 /* Peg C: {CCW -> B (1), CW -> A (0)} */ // #C03
};
/* Calculate direction index: -1 -> 0 (CCW), 1 ->1 (CW) */
int direction_index = (direction + 1) / 2;
for (int n_moves = 1; n_moves <= total_moves; n_moves++) {
int curr_gray = n_moves ^ (n_moves >> 1);
int prev_gray = (n_moves - 1) ^ ((n_moves - 1) >> 1);
int changed_bit = curr_gray ^ prev_gray;
/* Identify the disk to move (0-indexed) */
int disk = __builtin_popcount(changed_bit - 1); // #C04
/* Current peg of the disk */
int curr_peg = pos[disk];
int new_peg;
if (disk == 0) {
/* Calculate shift: direction_index=0 (CCW) -> shift2,
* direction_index=1 (CW) -> shift0
*/
int shift = (1 - direction_index) << 1;
new_peg = (peg_map[curr_peg] >> shift) & 0x3; // #C05
} else {
/* Find the only peg not occupied by any smaller disk */
bool found_new_peg = false;
for (int p = 0; p < 3 && !found_new_peg; p++) {
if (p == curr_peg)
continue;
/* Check if any smaller disk is on peg p */
bool has_smaller = false;
for (int d = 0; d < disk; d++) {
if (pos[d] == p) {
has_smaller = true;
break;
}
}
if (!has_smaller) {
new_peg = p;
found_new_peg = true;
}
}
}
/* Execute the move */
print_move(disk + 1, pegs[curr_peg], pegs[new_peg]);
pos[disk] = new_peg;
}
return 0;
}
```
C01 = `0x9`
C02 = `0x2`
C03 = `0x4`
C04 = `changed_bit-1`
C05 = `0x3`
#### 📝 Annotate
:::info
Peg `A` movement options: `0b1001`
CCW to `C(2)` set `0b10`, CW to `B(1)` set `0b01`
Peg `B` movement options: `0b0010`
CCW to `A(0)` set `0b00`, CW to `C(2)` set `0b10`
Peg `C` movement options: `0b0100`
CCW to `B(1)` set `0b01`, CW to `A(0)` set `0b00`
`int disk = __builtin_popcount(#C04);` Decide which disk to move
First observing the movement pattern of the Tower of Hanoi:
* The smallest disk (disk 0) moves every 1 step.
* The second smallest disk (disk 1) moves every 2 steps.
* Disk 2 moves every 4 steps.
* Disk 3 moves every 8 steps.
* and so on...
Now, observe the pattern of Gray codes:
* bit_0 changes every 1 step.
* bit_1 changes every 2 steps.
* And so on.
This matches the movement pattern of the Tower of Hanoi, where each disk's movement frequency follows the same rule as the corresponding Gray code bit's change frequency.
```
move changed_bit changed_bit-1 popcount disk
1 1 (0001) 0 (0000) 0 0
2 2 (0010) 1 (0001) 1 1
3 1 (0001) 0 (0000) 0 0
4 4 (0100) 3 (0011) 2 2
5 1 (0001) 0 (0000) 0 0
6 2 (0010) 1 (0001) 1 1
7 1 (0001) 0 (0000) 0 0
8 8 (1000) 7 (0111) 3 3
```
1. `changed_bit` always causes only one bit to change, which is exactly a power of 2
2. The number of 1's in binary after subtracting 1 exactly corresponds to the disk number to be moved.
3. `#C04` = `changed_bit-1`
new_peg = (peg_map[curr_peg] >> shift) & `#C05`
* Regardless of the direction (clockwise or counterclockwise), you only need to take 2 bits to determine the new_peg.
* `#C05` = `0b11` = `0x3`
:::
### Problem D
D01 = `add t2, t2, t1`
D02 = `srli`
D03 = `slli`
D04 = `sw t2, 0(x1)` OR equivalent
#### 📝 Annotate
:::info
The code section has already been thoroughly explained, so no additional annotate are provided here.
:::
## Quiz 3
### Problem A
A01 = ==`0x37`==
A02 = ==`0x17`==
A03 = ==`7`==
A04 = ==`15`==
A05 = ==`12`==
A06 = ==`20`==
A07 = ==`-4`==
A08 = ==`4`==
#### 📝 Annotate
:::info
A01、A02: The opcode of `lui` and `auipc` are as follows

A03~A05:

A06: According to```/* imm[11:0] = inst[31:20] */```
shift right 20 bits
A07: The function of `auipc` is as shown in the figure

In the `main` function, the program counter (`PC`) was pre-incremented by `PC + 4`. To correctly execute the `auipc` instruction, the value of the `PC` must be restored to its original value. Therefore, it is necessary to subtract 4 from the current `PC`.
A08: In general, when a jump does not occur, `PC = PC + 4`
:::
### Problem B
B01 = ==`0`==
B02 = ==`32`==
B03 = ==`33`==
B04 = ==`34`==
B05 = ==`perform_one_cycle_of_booths_mutliplication(csa_out, multiplicand, multiplier)`==
B06 = ==`23`==
B07 = ==`15`==
B08 = ==`7`==
B09 = ==`31`==
B10 = ==`sign_extend(partial_carry.lo, shift_amount, 64)`==
B11 = ==`adder(partial_sum.hi, partial_carry.hi, alu_carry_in)`==
#### 📝 Annotate
:::info
:::
### Problem C
C01 = ==`srli a1, a1, 8`==
C02 = ==`beq a2, a4, sq_1`==
C03 = ==`srai a2, a2, 1`==
C04 = ==`srli a3, a1, 21`==
C05 = ==`srai a0, a0, 12`==
C06 = ==`mul a0, a0, a0`==
C07 = ==`srli a0, a0, 15`==
C08 = ==`srai a0, a0, 21`==
C09 = ==`slli a1, a1, 9`==
C10 = ==`srai a0, a0, 5`==
C11 = ==`addi a0, a0, 16 `==
C12 = ==`sltu a4, a3, a0`==
C13 = ==`slli a1, a1, 16`==
C14 = ==`slli a2, a2, 23`==
#### 📝 Annotate
:::info
Our goal is to find an approximation for $\frac{1}{\sqrt y}$, We can define the equation as: $f(m) = \frac{1}{m^2}-y$, and aim to solve for the root of $f(m) = 0$, where m = $\frac{1}{\sqrt y}$
The Newton-Raphson formula is: $m_{i+1} = m_i - \frac{f(m_i)}{f'(m_i)}$
First, find the derivative of $f(m_i):f'(m_i) = -\frac{2}{m^3}$
Substitute $f(m_i)$ and $f'(m_i)$ into the Newton-Raphson formula and simplify, resulting in: ==$m_{n+1} = m_n \times(1.5−0.5\times y\times m_n^{2})$==
Additionally, the basic formula converges faster but has lower precision.
At this point, a correction formula is available, which converges more slowly but is effective for high-precision calculations. It is typically used as the final step for correction. The formula is as follows: ==$y_1 = y_0 - \frac{dy_0}{2}$==, x is the original input value. $y_0$ is the initial reciprocal square root approximation. $y_1$ is the updated approximation.
The explanation of implementing `sqrtf` in RISC-V assembly code is as follows:
1. Check if `a0` is negative. If it is, jump to `sq_0`.
2. Extract the mantissa (23+1 bits) in `a1`. Left-shift the mantissa by 8 bits, add the hidden 1 from the floating-point representation, then right-shift it by 8 bits (==C01== = `srli a1, a1, 8`) to restore it to its original position.
For example:
```
0 11010110 11010110000000000000000 -> shift left
0 11010110 00000000000000000000000 -> Set bit
1 11010110 00000000000000000000000 -> Restore
0 00000001 11010110000000000000000
```
3. Extract exponent and store it in `a2`. If the exponent is 0, jump to `sq_2` (which returns 0). If the exponent is 255 (INF), jump to `sq_1` (which returns INF). The instruction for this is ==C02== = `beq a2, a4, sq_1`.
4. Pre-correction for packing:
* The square root calculation halves the exponent. Adding the bias of 127 and then subtracting 2 (resulting in 125) ensures the exponent remains in the correct range after halving, while also preventing overflow or underflow.
5. Halve the exponent(square root operation halves the exponent) - if odd, double the mantissa.
6. Using the lookup method to obtain an initial approximation of the square root:
1. Prepare a predefined lookup table (`rsqrtapp`)
2. Use the high 4 bits of the mantissa as the index
3. Obtain the initial approximation value
4. Use this value as the starting point for Newton-Raphson iteration
* The mantissa originally occupies 23 bits. Adding the hidden 1 bit makes it 24 bits, and with the additional left shift by 1 bit from the previous step, it becomes 25 bits. To extract the top 4 bits of the mantissa, need to shift it right by 21 bits(==C04== = `srli a3, a1, 21`).
7. Perform the first Newton-Raphson iteration:
* Since the `a4` value occupies 8 bits and the mantissa will be multiplied by a4 twice, the mantissa is right-shifted by 7 bits (32 - 8 - 8 - 23). This step ensures that the mantissa is correctly aligned for subsequent multiplications.
* The second `mul a0, a0, a4` correspound to $y\times m_n^2$
* Adjust Q32 format to Q20: ==C05== = `srai a0, a0, 12`
* Process the correction formula to finally obtain a more accurate $m_{n+1}$
8. Perform the second Newton-Raphson iteration.
* Square m1: $m1 \times m1$ -> ==C06== = `mul a0, a0, a0`
* Scale from Q32 to Q17, ==C07== = `srli a0, a0, 15`
* Scale from Q36 to Q15, ==C08== = `srli a0, a0, 21`
9. Perform the final precision correction and rounding:
* Calculate the remainder: Determine the difference between the current approximation and the actual value.
* Generate a correction term: Compute a small adjustment to refine the approximation further.
* Adjust the approximate solution: Apply the correction term to improve the final result.
* Rounding strategy: Decide on an appropriate rounding method (e.g., round up, round down, or round to the nearest) to achieve the desired precision.
:::
### Problem D
#### 📝 Annotate
D01 = ==`Out = (¬A + ¬B)¬C`==
> Alternative: 
:::info
1. C must be 0 for the PMOS on the top to conduct: `¬C`.
2. Either A or B must be 0 for the corresponding PMOS on the top to conduct: `(¬A + ¬B)`.
1 and 2 are connected in series, resulting in: `¬C(¬A + ¬B)`
When these conditions are satisfied, **Out** will connect to High and output 1.
Conversely, **Out** will connect to Low and output 0.
:::
D02 = ==00101000==
D03 = ==A==
#### 📝 Annotate
:::info
Based on the input sequence, the state transition process is as follows:`ADABCDADA`
The FSM end state is A.
:::
D04 = ==3==
#### 📝 Annotate
:::info
The propagation delay is determined by the longest path's delay.
One of the longest paths is as follows:

:::
D05 = ==9==
> the critical path is
B0 -(1 gate)- P0
P0 -(3 gate)- Cout0
Cout0 -(2 gate)- Cout1
Cout1 -(2 gate)- Cout2
Cout2 -(1 gate)- Σ3
#### 📝 Annotate
:::info

:::
### Problem E
E01 = ==42==
> Critical path = clk-to-q + longest CL + setup = 30ns for x_input to change (includes clk-to-q) + 9 AND + 3 setup = 42 ns
E02 = ==18==
> Shortest CL: NOT → AND = 5 + 9 = 14ns clk-to-q + shortest CL = 4ns + 14ns = 18ns
E03 = ==102==
> 25 + 50 (clock period) + 4 (clk-to-q) + 14 (OR) + 9 (AND) = 102 ns
#### 📝 Annotate
:::info
The clock period must ensure that data stabilizes and reaches the next register before the next clock edge. Additionally, we must ensure the data arrives before the setup time of the next clock edge. Thus, the clock period must be: $Clock\ Period ≥ x\_input\ to\ change (30ns) + AND (9ns) + setup (3ns)$
The maximum hold time prevents the current data from being overwritten by new data in the next clock cycle. It ensures that the current data can be correctly read. To ensure this, the time for old data to propagate through the shortest path must be greater than the clk-to-q time of the new data.
Thus, $Max\ Hold\ Time = Shortest\ CL + clk\_to\_q$.
When waiting for the next clock period, the wait time already accounts for the delay of x_input chang. The change in x_input occurs before the next clock edge (at 75ns), specifically at 55ns.
Thus, the actual delay affecting the output happens after the next clock edge.
Timeline:
```
25ns: First rising edge.
55ns: x_input changes.
75ns: Next rising edge (25 + 50).
79ns: clk-to-q delay (75 + 4).
93ns: OR gate delay (79 + 14).
102ns: AND gate delay (93 + 9).
```
:::
### Problem F
F01 = ==4==
> We have 9 instructions, so we need 4 bits to represent them.
F02 = ==16==
> Recall that, in RISC-V, the immediates encoded in instructions work in units of half-instructions, so jumping up to 64 KiB in either direction is the same as saying we want to jump up to $2^{15}$ half instructions in both directions. Since immediates are signed, we need 16 bits to represent this range of values.
#### 📝 Annotate
:::info
F01: $\lceil{\log_2{9}}\rceil = 4$
F02: Because the least significant bit (LSB) is implicitly 0:
Each value is effectively a multiple of 2, meaning the actual range that needs to be represented is ±32,768 (65,536 / 2).
Calculating the number of bits needed:
$-32,768 = -2^{15}\;; +32,768 = +2^{15}$
This requires 15 bits to represent the range.
Add 1 bit for the sign bit. Thus, a total of 16 bits is needed.
:::
## Quiz 4
### Problem A
A01 = `blt x0, a0, recurse_case`
A02 = `jalr x0, ra, 0`
A03 = `addi sp, sp, -12`
A04 = `sw ra, 8(sp)`
A05 = `jal ra, square`
A06 = `lw ra, 8(sp)`
A07 = `addi sp, sp, 12`
A08 = `addi sp, sp, -4`
A09 = `jal ra, sum_of_squares`
A10 = `addi sp, sp, 4`
A11 = `jalr x0, ra, 0`
#### 📝 Annotate
:::info
A01: if n (a0) is less than or equal to zero, do the zero_case. Otherwise, do the recurse_case.
A02: return -> `jr ra`(pseudo-instructions) = `jalr x0, ra, 0`
A03: Save 3 reg, $3\times4 = 12$.
A04: return address save in `ra`
A05: Call square and store `pc` to `ra`.
A06: return address save in `ra`
A07: 3 reg, $3\times4 = 12$.
A08: Save 1 reg, $1\times4 = 4$.
A09: Call sum_of_squares and store `pc` to `ra`.
A10: 1 reg, $1\times4 = 4$.
A11: `jr ra`(pseudo-instructions) = `jalr x0, ra, 0`
:::
From the perspective of the callee, is it necessary to save any registers in the prologue and epilogue of the function? If so, which registers need to be saved, and where should the prologue and epilogue be placed? If not, provide a brief explanation as to why. __ A12 __
A12 = (一定要有 No 或 Not,意思相近就給分)
> No, we do not need to consider callee-saved registers because we are not using any in this function. However, since the function makes two function calls, it may be necessary to save the `ra` register in the prologue and restore it in the epilogue, right before the `jalr x0, ra`, 0 instruction, just before the zero_case label.
A13 = `sw ra, 0(sp)`
A14 = `and t3, t2, x1`
:::warning
`x1` is `ra`
`andi t3, t2, 1`
:::
A15 = `slli t1, t1, 1`
A16 = `srli t2, t2, 1`
A17 = `bne t2, x0, square_loop`
A18 = `lw ra, 0(sp)`
A19 = `jalr x0, ra, 0`
#### 📝 Annotate
:::info
A13: return address save in `ra`
A14: To check whether the least significant bit (LSB) of `t2` is 1, perform a bitwise AND operation between t2 and 1. The result will indicate the LSB and store in `t3`.
A15: Shifting `t1` left by 1 bit is equivalent to multiplying by 2
A16: Shifting `t2` right by 1 bit is equivalent to multiplying by 2
A17: If t2 is not zero, do the square_loop. Otherwise, Continue to execute
A18: Restore return address
A19: `jr ra`(pseudo-instructions) = `jalr x0, ra, 0`
:::
### Problem B
B01 = 14
B02 = 25
B03 = 40M
> The maximum allowable hold time for RegC is determined by the time it takes for the input of RegC to change. This is calculated as the clk-to-q delay of RegA or RegB plus the shortest combinational logic delay:
$$
4 \, \text{ns} + (5 \, \text{ns} + 5 \, \text{ns}) = 14 \, \text{ns}
$$
The minimum clock cycle time must account for the clk-to-q delay, the longest combinational logic delay, and the setup time of RegC:
$$
4 \, \text{ns} + (5 \, \text{ns} + 5 \, \text{ns} + 5 \, \text{ns}) + 6 \, \text{ns} = 25 \, \text{ns}
$$
A clock cycle time of 25 ns corresponds to a clock frequency of:
$$
\frac{1}{25 \times 10^{-9} \, \text{s}} = 40 \, \text{MHz}
$$
#### 📝 Annotate
:::info
B01: shortest combinational logic delay

B02: One of cirtical path

B03: $$clock\,frequency = {1 \over {clock\,cycle\,time}}$$
:::
### Problem C
C01 =`~(k - 1)` (或等效的形式)
> k \& (~(k - 1)):Isolates the least significant bit set in 𝑘, which represents the smallest power of 2 divisor of 𝑘.
C02 = (意思相近就給分,務必提到 power of 2)
> Since the division operation is performed by a power of 2, the compiler optimizes it into a bit shift. This results in an efficient 𝑂(𝑏) solution for finding the 𝑘-th element in a left-complete, array-based binary tree.
#### 📝 Annotate
:::info
The purpose of this function is to find the position of the K-th smallest element in a subtree of height H, rooted at P, after sorting the elements in ascending order.
Given `p = 3`, `h = 2`, `k = 3`, let's break it down step by step:
1. `p << h` = `3 << 2` Calculate the starting position of the layer(leftmost node position)
```
Binary representation of 3 : 11
Left-shifting by 2 bits : 1100 (=12)
```
2. `k >> 1` = `3 >> 1` positioned element
```
Binary representation of 3 : 11
Right-shifting by 1 bit: 1
```
3. The or operation `|` merges location information. At this point, the position of a leaf node of the subtree will be pointed out.
4. `k & ~(k-1)` This step is used to find the position of the rightmost 1 in k.
```
k=3: 0011
k-1=2: 0010
~(k-1): 1101
k & ~(k-1): 0001 (=1)
```
5. The final division is used to determine the final position:
Jump `k & ~(k-1)` nodes up according to the position of the leaf node in the 3rd step
```
k_child(3, 2, 3) = (((3 << 2) | (3 >> 1)) / (3 & ~(2)))
= ((12 | 1) / 1)
= 13
```
For the subtree at position 3 and height 2, the 3rd largest element is at position 13
:::
### Problem D
D01 = $-2^{18}$
D02 = $2^{18} - 1$
> The immediate field for the `jal` instruction is 20 bits, while for the `jalr` instruction, it is only 12 bits. This means `jal` can cover a wider range of addresses. Similar to the previous addressing mode, the 20-bit immediate value is multiplied by 2 and added to the program counter (PC) to determine the target address. Since the immediate value is signed, it spans a range of $[-2^{20}, 2^{20} - 2]$ bytes, or $[-2^{19}, 2^{19} - 1]$ 2-byte instructions. However, because we are dealing with 4-byte instructions, the effective range is $[-2^{18}, 2^{18} - 1]$ instructions relative to the current PC.
D03 = `"jalr"`
D04 = `"lbu"`
D05 = `"lhu"`
D06 = `"slli"`
D07 = `"slti"`
D08 = `"sltiu"`
D09 = `"andi"`
D10 = `"ecall"`
D11 = `"csrrs"`
D12 = `"bltu"`
D13 = `"bgeu"`
D14 = `"lui"`
D15 = `"auipc"`
D16 = `"jal"`
D17 = `e.u.i31_12`
D18 = `e.j.i10_1 << 1`
D19 = `e.s.i11_5 << 5`
D20 = `e.b.i4_1 << 1`
#### 📝 Annotate
:::info
Please see the link for RV32 instruction set architecture:
https://cs61c.org/fa24/pdfs/resources/reference-card.pdf
https://five-embeddev.com/riscv-user-isa-manual/Priv-v1.12/instr-table.html
D03:

D04:

D05:

D06:

D07:

D08:

D09:

D10:

D11:

D12:

D13:

D14:

D15:

D16:

D17:

D18:The purpose of the left shift is to recombine the immediate bits scattered at different locations in the instruction into a complete offset value

D19:

D20:

:::
### Problem E
E01 = 665
> sw = clk-to-Q + Mem-Read + max(RegFileRead + Mux(ASel), Imm-Gen +
Mux(BSel)) + ALU + DMEM Setup
= 5ns + 300 ns + 60ns + 100ns + 200ns
= 665ns
E02 = 800
> lw = clk-to-Q + Mem-Read + max(RegFileRead + Mux(ASel), Imm-Gen +
Mux(BSel)) + ALU + Mem-Read + Mux(WBSel) + RegFileSetup
= 5ns + 300ns + 60ns + 100ns + 300ns + 15ns + 20ns
= 800ns
E03 = 500
> jal= clk-to-Q + Mem-Read + Imm-Gen + Mux(BSel) + ALU + Mux(PCSel) + PCSetup
= 5ns + 300ns + 45ns + 15ns + 100ns + 15ns + 20ns
= 500ns
E04 = PC + 4 注意: E04-E06 的順序可換
E05 = ALUOut
E06 = MEM
E07 = Inst
E08 = PC
E09 = ALUOut 注意: E09 和 E10 的順序可換
E10 = RegReadData2
E11 = Inst
#### 📝 Annotate
:::info
E01: The red line is the critical path

E02:

E03:

E04~E06:

E07:
**INST** represents the instruction itself.
The control logic needs to determine the next operation based on the content of the instruction.
Necessity of Passing INST in a Pipeline:
1. **Instruction Awareness in Each Stage:** Every stage must know what instruction is currently being processed.
2. **Accurate Control Signals:** This allows the control logic in the next stage to generate the correct control signals.
3. **Correct Stage Execution:** Ensures that each stage operates correctly based on the specific instruction being processed.
E08~E10:

E11: Same as E07.
:::
### Problem F
F01 = 1
F02 = 2
F03 = 1
F04 = MEM
F05 = EX
F06 = WB
F07 = EX
#### 📝 Annotate
:::info
F01~F03: addi `rd` = and `rs1` = sltiu `rs1`

F04~F07: forwarding path

:::
F08 = 2
F09 = 3
F10 = 3
F11 = 4
F12 = 2
F13 = 3
F14 = 3
F15 = 4
F16 = 4
F17 = 3
F18 = 3
F19 = 4
F20 = 2-3-1-4
#### 📝 Annotate
:::info
F08~F09: addi `rd` = lw `rs1`
addi `t0`, t0, 4, lw t1, 0(`t0`)
F10~F11: lw `rd` = add `rs1`
lw `t1`, 0(t0), add t2, `t1`, x0
F12~F13: This hazard can be resolved by MEM to EXE

F14~F19: Between instructions 3 and 4 are Load-Use Data Hazard,Since the `lw` instruction retrieves its data in the MEM stage, but the subsequent `add` instruction requires that data in the EXE stage, the pipeline must stall for 1 cycle.
In the next cycle, the data becomes available in the WB stage and can then be forwarded to the EXE stage of the `add` instruction. This ensures correct execution while addressing the data hazard.

F20:
The original instruction order is `1-2-3-NOP-4`. Since instruction 1 has no dependencies, it can be inserted between instruction 3 and instruction 4, replacing the NOP.
The new instruction order becomes `2-3-1-4`.
:::
F21 = 4
> There are four hazards: a data hazard from `t1` between instructions 1 and 2, a data hazard from `s0` between instructions 2 and 3, another data hazard from `s0` between instructions 2 and 4, and a control hazard between instructions 4 and 5.
F22 = 2
> Assuming the RegFile allows simultaneous read and write operations in the same cycle, two stalls are needed between instructions 1 and 2 (WB → ID), and another two stalls are needed between instructions 2 and 3 (WB → ID).
### Problem G
G01 = `reg.io.q`
G02 = `Mux(count === 9.U, 0.U, result)`
G03 = `reg.io.d := next`
G04 = `val y = Output(UInt (16.W))`
G05 = `io.y := io.a & io.b`
:::warning
G01:`val count = reg.io.q`?
G02:`val next = Mux(count === 9.U, 0.U, result)`?
G05:`is (3.U) { io.y := io.a & io.b }`?
:::
#### 📝 Annotate
:::info
G01: Connects the output of the register to the corresponding line for count.
G02: When `count` equals 9, select 0 as the output.
Otherwise, select `result` (the output of the adder) as the output.
This implements the functionality of resetting the counter to 0 after reaching 9.
G03: Connects the output of the Mux(`next`) to the input of the register.
G04: The form of declaring the output y.
G05: Implement the remaining functionality and(&).
:::
### Problem H
H01 = `srli t2, a0, 31`
H02 = `srli t2, a1, 31`
H03 = `bgeu a2, t0, overflow_xe_gte_ye`
H04 = `srli a4, a4, 9`
H05 = `slli t2, t2, 31`
H06 = `0x7F800000`
H07 = `slli t2, t2, 31`
#### 📝 Annotate
:::info
H01、H02: The sign bit is stored in the most significant bit, so performing a logical right shift of 31 bits will retrieve the sign bit.
H03: if `a2` > 255(`t0`) jump to `overflow_xe_gte_ye`
H04: In the previous operation, the significand was left-shifted to the highest bit. After completing the computation, it needs to be right-shifted by 9 bits to align properly.
H05: The sign is in the least significant bit. Left-shifting it by 31 bits moves it back to its correct position.
H06:

H07: Same as H05
:::
H08 = `sll a1, a1, t1`
H09 = `slli a1, a1, 10`
H10 = `slli a0, a0, 23`
#### 📝 Annotate
:::info
H08: t1 = -A0 = -(leading zeros)
`A1 << (32 - leading_zeros)` is equivalent to `A1 << (-leading_zeros)`.
**Characteristics of shifting with negative numbers:**
When using a negative value for a left shift, it is effectively interpreted as 32 + (-n) for the shift amount.
Thus, -leading_zeros for a left shift is the same as shifting by (32 - leading_zeros).
H09: The previous instruction shifts right by 9 bits.
Adding the step to remove the hidden '1' results in a total left shift of 10 bits.
H10: Left-shift the exponent by 23 bits.
This aligns it to the position where the exponent resides in the IEEE 754 format.
:::
## Quiz 5
### Problem A
A01
> ==40%==
Since the L1 cache has a local miss rate of 50%, it means L2 cache is accessed 50 times out of the 100 total memory accesses. Given that L2 cache experienced 20 misses, the local miss rate of L2 cache is calculated as:$$\text{Local Miss Rate of L2} = \frac{20 \ \text{misses}}{50 \ \text{accesses}} \times 100 = 40\%$$
A02
> ==25%==
The number of accesses to the L2 cache equals the number of misses in the L1 cache. To find the local miss rate of the L2 cache, we divide its global miss rate by the miss rate of the L1 cache:
$$\text{L2 Local Miss Rate} = \frac{\text{Misses in L2}}{\text{Accesses in L2}} = \frac{\frac{\text{Misses in L2}}{\text{Total Accesses}}}{\frac{\text{Misses in L1}}{\text{Total Accesses}}} = \frac{\text{Global Miss Rate of L2}}{\text{Miss Rate of L1}}$$
Substituting the values:
$$\text{L2 Local Miss Rate} = \frac{5\%}{20\%} = 0.25 = 25\%$$
A03
> ==30==`
Let ( H ) represent the hit time of the L3 cache. By extending the AMAT equation, where the miss penalty of the L2 cache becomes the local AMAT of the L3 cache, we can express the system's AMAT as:
$$AMAT = 2 + 0.2 \times \left( 15 + 0.25 \times \left( H + 0.3 \times 100 \right) \right)$$
To ensure the AMAT does not exceed 8 cycles, we solve the inequality:
$$2 + 0.2 \times \left( 15 + 0.25 \times \left( H + 30 \right) \right) \leq 8$$
Simplifying this yields:
$$H \leq 30$$
Thus, the maximum hit time for the L3 cache is 30 cycles.
A04~A06
> ==6==, ==4==, ==10==
The breakdown of the Tag:Index:Offset (T/I/O) is calculated as follows:
* Offset = $\log_2(1 \, \text{KiB}) = \log_2(2^{10}) = 10$ bits
* Index = $\log_2\left(\frac{16 \, \text{KiB}}{1 \, \text{KiB}}\right) = \log_2(16) = 4$ bits
* Tag = Total address bits - Index bits - Offset bits$$\text{Tag} = 20 - 4 - 10 = 6 \, \text{bits}$$
A07
> ==50%==
Since integer accesses are spaced $4 \times 128 = 512$ bytes apart, each cache block holds two accesses. The first access to each block results in a compulsory miss, while the second access hits because both `A[i]` and `A[i + 128]` reside in the same cache block. This results in a cache hit rate of 50%.
A08
> ==50%==
The size of array A is $8192 \times 4 = 2^{15}$ bytes, which is exactly twice the size of the cache. After executing Line 7, the second half of `A` resides in the cache, but Line 11 starts processing from the first half of `A`, meaning none of the cached data from Line 7 can be reused. Since memory accesses in Line 11 follow the same pattern as Line 7, the cache hit rate remains the same at 50%. The variable `total` is likely kept in a register by the compiler, so it does not impact the cache hit calculation.
> A09 = ? A10 = ? A11 = ? A12 = ?
==1100==, ==1==, ==000==, ==Miss== or ==M==
> A13 = ? A14 = ? A15 = ? A16 = ?
==1101==, ==1==, ==101==, ==Replace== or ==R==
> A17 = ? A18 = ? A19 = ? A20 = ?
==0011==, ==0==, ==000==, ==Replace== or ==R==
A21
> ==27.3%== or ==$\frac{3}{11}$==
There are 3 hits out of 11 total accesses, resulting in a hit rate of approximately 27.3%.
#### 📝 Annotate
:::info
:::
### Problem B
B01
> ==4==
There are four hazards present:
A data hazard between instructions 1 and 2 due to `t1`.
A data hazard between instructions 2 and 3 caused by `s0`.
A data hazard between instructions 2 and 4 also involving `s0`.
A control hazard between instructions 4 and 5 caused by the branch instruction.
B02~B11
> B02 = ?
==2==
B03 = ?
==1==
B04 = ?
==2==
B05 = ?
==2==
B06 = ?
==2==
B07 = ?
==3==
B08 = ?
==2==
B09 = ?
==4==
B10 = ?
==no== (or zero)
B11 = ?
==3==
| Instruction | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 |
| -------- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| 1.`sub t1, s0, s1` | IF | ID | EX | MEM| WB | | | | |
| 2.`or -> nop` | | IF | ID | | | | | | |
| 2.`or s0, t0, t1` | | | IF | ID | EX | MEM| WB | | |
| 3.`sw s1, 100(s0)` | | | | IF | ID | EX | MEM| WB | |
| 4.`bgeu s0, s2, loop`| | | | | IF | ID | EX | MEM| WB |
| 5.`add t2, x0, x0` | | | | | | IF | ID | EX | MEM|
#### 📝 Annotate
:::info
:::
### Problem C
#### 📝 Annotate
:::info
C01: Using Branch Comp to compare the `rs1` and `rs2` of the `beq` instruction. If they are equal, Branch Comp sets `BrEq` to 1; otherwise, it sets to 0.
C02: The `jal` instruction does not compare rs1 and rs2; therefore, `BrLt` is a "*(don't care)" value.
C03: Using Branch Comp to compare the `rs1` and `rs2` of the `beq` instruction. If `rs1` less than `rs2`, Branch Comp sets `BrLt` to 1; otherwise, it sets to 0.
C04~C08: Only **branch** or **jump** instructions cause a jump.
* For other instructions, set `PCSel` to 0.
* For jump instructions, set `PCSel` to 1.
* For branch instructions, determine `PCSel` based on `BrEq` and `BrLt`, setting it to 0 or 1 accordingly.
C09~C10: `BrUn` is **Branch Unsigned**,1 determines signed comparison, and 0 performs unsigned comparison.
* For the `beq` instruction, the result is the same regardless of signed or unsigned comparison, so `BrUn` is "*(don't care)"
* For **branch less than**, signed or unsigned comparison matters:
* The `blt` instruction requires signed comparison, so `BrUn` should be set to 0.
* The `bltu` instruction requires unsigned comparison, so BrUn should be set to 1.
C11~C15:
* `lw` (load word): Loads data from memory into register, so set `RegWEn` to 1.
* `sw` (store word): Stores data from a register into memory but does not write to a register, so set `RegWEn` to 0.
* Branch instructions: Do not write to any register, so set `RegWEn` to 0.
* `jal` (Jump and Link): The instruction format is `jal rd, label`, where it performs `rd = PC + 4` and `PC = rs1 + imm`. Since it writes to the register `rd` and updates the `PC`, set `RegWEn` to 1.
C16: `WBSel` (Write Back Select) determines the source of the data to be written into the register:
* 0: Data comes from memory.
* 1: Data comes from the result of ALU computation.
* 2: Data comes from PC + 4
For the `jal` instruction, which writes `PC + 4` into `rd`, `WBSel` is set to 2.
:::
### Problem D
D01 = ?
> `sll s3, s3, s2`
D02 = ?
> `and s3, a1, s3`
D03 = ?
> `sll s5, s5, s4`
D04 = ?
> `or s1, s1, s5`
D05 = ?
> `addi s0, s0, 1`
================
D06 = ?
> `sub t1, t1, t0`
#### 📝 Annotate
:::info
https://hackmd.io/@vx6122035/2024-quiz7-problem-d
:::
### Problem E
E01 = ? E02 = ? E03 = ?
> ==0x13==, ==0x17==, ==1==
E04 = ? E05 = ? E06 = ?
> ==0x10==, ==0x13==, ==1==
E07 = ? E08 = ? E09 = ?
> ==0x20==, ==0x12==, ==1==
E10 = ? E11 = ? E12 = ?
> ==0x23==, ==0x18==, ==1==
E13 = ? E14 = ? E15 = ?
> ==0x11==, ==0x14==, ==0==
E16 = ? E17 = ? E18 = ?
> ==0xac==, ==0x15==, ==1==
E19 = ? E20 = ? E21 = ?
> ==0x34==, ==0x19==, ==1==
> 1. Access `0x11F0` (Read): Hit, Physical Address: `0x14F0`; Updated LRUs: `1, 7, 2, 5, 7, 0, 3, 4`
Corresponding Virtual Page: `0x11`
> 2. Access `0x1301` (Write): Miss, Map VPN `0x13` to PPN `0x17`, set valid and dirty bits.
Physical Address: `0x1701`; Updated LRUs: `2, 0, 3, 6, 7, 1, 4, 5`
> 3. Access `0x20AE` (Write): Hit, set dirty bit.
Physical Address: `0x12AE`; Updated LRUs: `3, 1, 4, 0, 7, 2, 5, 6`
> 4. Access `0x2332` (Write): Miss, Map VPN `0x23` to PPN `0x18`, set valid and dirty bits.
Physical Address: `0x1832`; Updated LRUs: `4, 2, 5, 1, 0, 3, 6, 7`
> 5. Access `0x20FF` (Read): Hit, Physical Address: `0x12FF`; Updated LRUs: `4, 2, 5, 0, 1, 3, 6, 7`
> 6. Access `0x3415` (Write): Miss (Replacement Needed). Replace the last entry, Map VPN `0x34` to PPN `0x19`, set dirty bit.
Physical Address: `0x1915`; Updated LRUs: `5, 3, 6, 1, 2, 4, 7, 0`
#### 📝 Annotate
:::info
From the statement:
**16-bit addresses**: This indicates the total virtual address space is 16 bits.
**256-byte pages**: Since a page size is 256 bytes, the Page Offset requires $\log_2(256) = 8$ bits.
The remaining bits 16 - 8 = 8 correspond to the Virtual Page Number (VPN).
:::
### Problem F
F01 = ?
> ==4==
With a memory size of 4KiB divided into 64B chunks, there are 64 total chunks. This requires 4 exponent bits to cover values from 0 to 63. The highest exponent value before applying the standard bias (excluding infinity/NaN values) is $2^4 - 2 = 14$
Applying the standard bias of -7 results in a maximum exponent value of 7, sufficient to cover values up to 63.
> Conversely, with only 3 exponent bits, the maximum exponent value after applying a bias of -3 is 3, insufficient to represent the value 63.
F02 = ?
> ==11==
The largest memory address to represent is $0b11\_1111.111111$, corresponding to $63 + \frac{63}{64}$. To encode this value, we need 11 mantissa bits, resulting in the representation &0b1.11111111111 \times 2^5&. Thus, 11 bits of mantissa are necessary.
#### 📝 Annotate
:::info
F01: Focusing only on the EXPONENT part
* 4 bits:
The maximum value that can be represented is `0b1110` (since `0b1111` is reserved for INF).
Subtracting the bias (`0b0111`), the result is $2^7 = 128$ which is greater than 63. Therefore, 4 bits are sufficient to represent all segments.
* 3 bits:
The maximum value that can be represented is `0b110`.
Subtracting the bias (`0b011`), the result is $2^3 = 8$, which is less than 63. Hence, 3 bits are not sufficient to represent all segments.
:::
### Problem G
G01 = ?
> ==maskeq_epi32(tmp, curr)==
G02 = ?
> ==firstv_epi32(tmp)==
G03 = ?
> ==arr[i] > running_max== (or equivalent)
#### 📝 Annotate
:::info
In the first for loop, the steps are as follows:
1. Load Elements in Groups of Four:
* Four elements are loaded into the vector `tmp`.
* The maximum value in `tmp` is loaded into `curr`.
2. Compare `curr` with `running_max`:
* If `curr` is greater than `running_max`, proceed to the subsequent steps.
3. Mask Elements Equal to `curr`:
* Create a mask for elements in tmp that are equal to curr.
* Example:
```
maskeq_epi32([2, 4, 8, 6], 8) → [0, 0, 0xFFFFFFFF, 0]
```
* This isolates the elements equal to the current maximum.
4. Find the Index:
* Use `index = i + firstv_epi32(tmp)` to determine the position of the first 0xFFFFFFFF bit in the mask.
* Since `0xFFFFFFFF` has its last bit set to 1, this gives the correct index.
5. Update `running_max` and Compare Remaining Elements:
In the second loop, the remaining numbers (those fewer than 4 elements) are handled using a normal element-by-element comparison.
Reason for This Design:
SIMD (Single Instruction, Multiple Data) operations work on fixed-size groups of elements, which in this case is 4. If the total number of elements is not divisible by 4, the leftover elements cannot be processed using SIMD instructions.
:::
### Problem H
H01 = ?
> ==4==
There are 9 instructions in total, so we need 4 bits to represent them, as $\lceil \log_2(9) \rceil = 4$
H02 = ?
> ==16==
Jumping 64 KiB in either direction equals $2^{16}$ bytes. Since immediate values are counted in half-instructions, we need 15 bits for the signed offset, covering $-2^{15}$ to $2^{15} - 1$. An extra bit is required for the sign, making the immediate field 16 bits.
H03 = ?
> ==128==
Total bits used by immediates and the opcode:$$2 \times 11 + 7 = 29 \text{ bits}$$
Remaining bits for registers:$$64 - 29 = 35 \text{ bits}$$
There are 5 registers (`rs1`, `rs2`, `rs3`, `rd1`, `rd2`). Dividing the remaining bits equally among the registers:$$\frac{35}{5} = 7 \text{ bits per register}$$
This allows $2^7 = 128$ registers.
#### 📝 Annotate
:::info
:::
### Problem I
I01 = ?
> ==4096== (Compulsory Misses for Loop A)
I02 = ?
> ==0== (No Conflict or Capacity Misses for Loop A)
I03 = ?
> ==4096== (Total Misses for Loop A)
I04 = ?
> ==4096== (Compulsory Misses for Loop B)
I05 = ?
> ==28672== (Conflict and Capacity Misses for Loop B)
I06 = ?
> ==32768== (Total Misses for Loop B)
#### 📝 Annotate
:::info
1. First, calculate some key parameters
* Cache size = 32 KiB = $2^{15}$ bytes
* Line size = 8 words = 32 bytes = $2^{5}$ bytes
* Number of lines = $2^{15}/2^{5}$ = $2^{10}$ lines
* Sets = $2^{10}/4$ = $2^8$ sets = 256sets (4-way set-associative)
2. Matrix dimensions:
* i goes from 0 to 255($2^8$), j goes from 0 to 63($2^6$)
* Each element is 4 bytes
* Row size = 64 × 4 = 256 bytes = 8 cache lines
For Loop A (i,j):
* Compulsory misses (I01): Each cache line is loaded once for first access
* Each row needs 8 cache lines
* 256 rows × 8 lines × 2 matrices = 4,096
* I01 = 4,096 misses
* Conflict/Capacity misses(I02):
* Row-major access means good spatial locality
* Inner loop accesses consecutive elements
* 4-way set associative with FIFO can hold multiple rows
* No additional misses, I02 = 0
* Total(I03) = 4096 + 0 = 4096
For Loop B (j,i):
* Compulsory misses (I04): Same as Loop A, I04 = 4096
* Conflict/Capacity misses (I05):
* Column-major access pattern
* Each column element is 256 bytes(8 lines) apart
* $256\;sets \div 8\;lines \times 4(4\;way) \div 2(2\;matrixs) = 64$
* Every 64 accesses to elements will completely overwrite the old data in the cache.
* $64 < 256$ (inner loop access), so when loading the next element in the same row, the cache has already been overwritten by other data (conflict miss).
* Each access results in a miss, so $I05 = I06(Total\;miss) - I04(Compulsory\;misses)$
* Total(I06) = $2^8 \times 2^6 \times 2 = 32768$
* I05 = 32768 - 4096 = 28672
:::
I07 = ?
> ==4096== (L2 Cache Misses for Loop A)
I08 = ?
> ==4096== (L2 Cache Misses for Loop B)
#### 📝 Annotate
:::info
L2 cache:
* Cache size = 256 KiB = $2^{18}$ bytes
* Line size = 8 words = 32 bytes = $2^{5}$ bytes
* Number of lines = $2^{18}/2^{5}$ = $2^{13}$ lines
* Sets = $2^{13}/8$ = $2^{10}$ sets = 1024 sets (8-way set-associative)
I07:
Compulsory misses:
Each cache line is loaded once for first access -> 4096miss
Conflict and Capacity Misses: good spatial locality -> No additional misses
Total = 4096 miss
I08:
Compulsory misses: 4096 miss
Conflict and Capacity Misses:
$1024\;sets \div 8\;lines \times 8(8\;way) \div 2(2\;matrixs) = 512$
8-way set associative with FIFO can hold all rows -> No additional misses
Total = 4096 miss
:::
I09 = ?
> ==11.75== cycles (AMAT for Loop A)
I10 = ?
> ==22.25== cycles (AMAT for Loop B)
#### 📝 Annotate
:::info
AMAT = $L1\;hit\;time + (L1\; miss\;rate \times (L2\;hit\;time + (L2\;miss\;rate * DRAM\;hit\;time))$
I09: $4 + (\cfrac{4096}{2^{8}\times2^{6}\times2} \times(12 + (\cfrac{4096}{4096}\times50)) = 11.75$
I10: $4 + (\cfrac{32768}{2^{8}\times2^{6}\times2} \times(12 + (\cfrac{4096}{32768}\times50)) = 22.25$
:::
I11 = ?
> ==Yes== (Virtual Address Aliasing Possible)
I12 = ?
> ==8== (Minimum Set-Associativity Required)
#### 📝 Annotate
:::info
* Page size = 4 KiB
* Cache size = 32 KiB
Virtual address aliasing problem:
When the cache size exceeds(>) the page size, different virtual addresses may map to the same physical address, causing an aliasing problem.
Here, 32 KiB > 4 KiB, so aliasing issues may occur.
To avoid the aliasing problem, a sufficient set associativity is required.
The set associativity must be at least equal to the cache size divided by the page size to ensure that the same physical address does not cause conflicts.
32 KiB / 4 KiB = 8, meaning the cache needs to be at least 8-way set associative to prevent aliasing.
:::