Annotate and explain Quiz 1 to 5

曹為廷

Quiz 1

Problem A

You are required to write C programs that convert between different floating-point formats. One of these formats is the half-precision floating-point, also known as FP16 or float16, which is a compact representation designed to save memory. Half-precision numbers are increasingly used in fields like deep learning and image processing. Unlike the single-precision floating-point format (float32), defined by IEEE 754, FP16 has fewer bits allocated for the exponent.

        ┌ sign
        │
        │  ┌ exponent
        │  │
        │  │    ┌ mantissa
        │  │    │
        │┌─┴─┐┌─┴──────┐
      0b0000000000000000 IEEE binary16

The functions for converting between FP16 and float32 (FP32) are provided below.

static inline float bits_to_fp32(uint32_t w)
{
    union {
        uint32_t as_bits;
        float as_value;
    } fp32 = {.as_bits = w};
    return fp32.as_value;
}

static inline uint32_t fp32_to_bits(float f)
{
    union {
        float as_value;
        uint32_t as_bits;
    } fp32 = {.as_value = f};
    return fp32.as_bits;
}

Consider the fp16_to_fp32 function, which converts an FP16 value to a single-precision IEEE floating-point (float32) format.

static inline float fp16_to_fp32(fp16_t h)
{
    const uint32_t w = (uint32_t) h << 16;
    const uint32_t sign = w & UINT32_C(0x80000000);
    const uint32_t two_w = w + w;

    const uint32_t exp_offset = UINT32_C(0xE0) << 23;
    const float exp_scale = 0x1.0p-112f;
    const float normalized_value =
        bits_to_fp32((two_w >> 4) + exp_offset) * exp_scale;

    const uint32_t mask = UINT32_C(126) << 23;
    const float magic_bias = 0.5f;
    const float denormalized_value =
        bits_to_fp32((two_w >> 17) | mask) - magic_bias;

    const uint32_t denormalized_cutoff = UINT32_C(1) << 27;
    const uint32_t result =
        sign | (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value)
                                            : fp32_to_bits(normalized_value));
    return bits_to_fp32(result);
}

Next, complete the fp32_to_fp16 function, which is responsible for converting a float32 value to FP16. The identifiers starting with #A indicate the sections where you need to provide the appropriate answers.

static inline fp16_t fp32_to_fp16(float f)
{
    const float scale_to_inf = 0x1.0p+112f;
    const float scale_to_zero = 0x1.0p-110f;
    float base = (fabsf(f) * scale_to_inf) * scale_to_zero;

    const uint32_t w = fp32_to_bits(f);
    const uint32_t shl1_w = w + w;
    const uint32_t sign = w & UINT32_C(0x80000000);
    uint32_t bias = shl1_w & UINT32_C(0xFF000000);
    if (bias < UINT32_C(0x71000000))
        bias = UINT32_C(0x71000000);

    base = bits_to_fp32((bias >> 1) + UINT32_C(0x07800000/*A01*/)) + base;
    const uint32_t bits = fp32_to_bits(base);
    const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00);
    const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF/*A02*/);
    const uint32_t nonsign = exp_bits + mantissa_bits;
    return (sign >> 16) |
           (shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00/*A03*/) : nonsign);
}

The values for #A01, #A02, and #A03 must be expressed as hexadecimal integer literals in their shortest form. For example, use 0x7C00 rather than 0x007C00.

A01 = 0x7800000
A02 = 0xFFF

A02 = 0x7FF~0x7FFFFF also work
As long as extract the rightmost 11 bits without including the exponent and sign parts, it will still work correctly.

A03 = 0x7E00

📝 Annotate

First, use scale_to_inf and scale_to_zero to detect whether the value f is outside the representable range of FP16.
For example:

A very small number
After scaling, due to being too small, the bias will be forced to 0x71000000, which will ultimately be converted to 0 (underflow).
A very large number
After scaling, it will become INF (overflow). Finally, through the condition (shl1_w > UINT32_C(0xFF000000)), it will return FP16 infinity 0x7E00.

Next, let w be the 32-bit integer representation of the float f, and let shl1_w be the value of w left-shifted by one bit.
Then, extract the most significant bit of w as the sign.

The bias is obtained by extracting the leftmost 8 bits of shl1_w, which corresponds to the exponent part of f.

The reason why the minimum value of bias is 0x71000000 is as follows:
In FP32, the exponent bias is 127, while in FP16, the exponent bias is 15. When converting a 32-bit floating-point number to a 16-bit floating-point number, we need to adjust the exponent.
Specifically, we subtract (127 - 15), which is 112, from the FP32 exponent. This is why the bias is set to no less than 0x71000000 (which equals 113 << 24). This value represents the smallest non-zero exponent that can be represented in FP16.

The value 0x07800000 corresponds to the exponent bias of 15 (the FP16 bias) shifted left by 23 bits.

Thus, the expression (bias >> 1) + UINT32_C(0x07800000) can be understood as:

Adding the FP16 bias (15) to the exponent part of bias.

The operation base_new = bits_to_fp32((bias >> 1) + UINT32_C(0x07800000)) + base_old effectively calculates:

$b a s e_{n e w} = 2^{o r i g i n a l e x p o n e n t + 15} + b a s e_{o l d}$

According to the calculation above, the exponent of base_new is equal to the exponent of f + 15.
Then, based on the rules of floating-point addition, the mantissa of base_new becomes the mantissa of f right-shifted by (15 - 2) bits (including rounding).
(15-2 is because during scaling, base is 4 times larger than f, so it is right-shifted 2 bits less .)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

exp_bits takes bits 28 to 24 of base_new as the exponent for the FP16 format and is shifted right to align with bits 15 to 11.

Next, mantissa_bits takes the rightmost 10 bits of base_new as the mantissa for the FP16 format.

The reason for taking 1 extra bit for mantissa_bits is that the exponent has a bias, which compensates for this extra bit, ensuring proper alignment with exp_bits.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Finally, perform a bitwise OR operation between the sign and nonsign to obtain the final FP16 representation.

Problem B

The bfloat16 format is a 16-bit floating-point representation, designed to provide a wide dynamic range by using a floating radix point. It is a shortened version of the 32-bit IEEE 754 single-precision format (binary32), aimed at accelerating machine learning.

The structure of the bfloat16 floating-point format is as follows.

        ┌ sign 
        │
        │   ┌ exponent
        │   │
        │   │      ┌ mantissa 
        │   │      │
        │┌──┴───┐┌─┴───┐
      0b0000000000000000 bfloat16

Since bfloat16 (bf16) shares the same number of exponent bits as a 32-bit float, encoding and decoding values is relatively simple.

static inline float bf16_to_fp32(bf16_t h)
{
    union {
        float f;
        uint32_t i;
    } u = {.i = (uint32_t)h.bits << 16};
    return u.f;
}

You are required to implement the fp32_to_bf16 function, which converts a float32 to bfloat16. This conversion should be binary identical to the BFloat16 format. Floats must round to the nearest even value, and NaNs should be quiet. Subnormal values should not be flushed to zero, unless explicitly needed.

static inline bf16_t fp32_to_bf16(float s)
{
    bf16_t h;
    union {
        float f;
        uint32_t i;
    } u = {.f = s};
    if ((u.i & 0x7fffffff) > 0x7f800000/*B01*/) { /* NaN */
        h.bits = (u.i >> 16) | 64;         /* force to quiet */
        return h;                                         
    }
    h.bits = (u.i + (0x7fff/*B02*/ + ((u.i >> 0x10/*B03*/) & 1))) >> 0x10/*B04*/;
    return h;
}

B01 = 0x7f800000
B02 = 0x7fff
B03 = 0x10

Why is it 16 instead of 15?

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

B04 = 0x10

📝 Annotate

The largest representable number in FP32 is infinity, with a hexadecimal representation of 0x7f800000.

If the binary representation exceeds 0x7f800000, it indicates that the exponent of s is 11111111, and the mantissa is not all zeros. In this case, the value is forced to be a quiet NaN (qNaN) and returned.
(A quiet NaN should have the highest bit of the mantissa set to 1, so 64 is forcibly OR-ed with the mantissa to ensure this.)

((u.i >> 16) & 1) is used to check whether the 17th bit of s is 1, in order to determine whether rounding is necessary.
0x7fff + ((u.i >> 16) & 1):

If rounding up is required, the result becomes 0x8000.
Otherwise, it remains 0x7fff.

Add this result to u.i, then right-shift the sum by 16 bits to extract the upper 16 bits as the final result.

Problem C

In Problem A, the fabsf function was used within the fp32_to_fp16 function. Now, you are required to implement your own version of fabsf.

static inline float fabsf(float x) {
    uint32_t i = *(uint32_t *)&x;  // Read the bits of the float into an integer
    i &= 0x7FFFFFFF;   /*C01*/     // Clear the sign bit to get the absolute value
    x = *(float *)&i;              // Write the modified bits back into the float
    return x;
}

C01 = 0x7FFFFFFF

📝 Annotate

Mask the highest bit to obtain the absolute value.

Consider the C implementation of __builtin_clz:

static inline int my_clz(uint32_t x) {
    int count = 0;
    for (int i = 31; i >= 0; --i) {
        if (x & (1U << i))
            break;
        count++;
    }
    return count;
}

The __builtin_clz function is a GCC built-in function that counts the number of leading zero bits in the binary representation of an unsigned integer, starting from the most significant bit (MSB). It returns an integer representing how many zero bits precede the first 1 in the binary representation of the number.

For example, given the number x, __builtin_clz(x) counts how many 0s are at the beginning of the binary form of x.

Examples of __builtin_clz

Example 1: For x = 16
- Binary representation: 00000000 00000000 00000000 00010000
- The number of leading zero bits is 27 (as there are 27 zeros before the first 1).
- Output: __builtin_clz(16) returns 27.
Example 2: For x = 1
- Binary representation: 00000000 00000000 00000000 00000001
- The number of leading zero bits is 31.
- Output: __builtin_clz(1) returns 31.
Example 3: For x = 255
- Binary representation: 00000000 00000000 00000000 11111111
- The number of leading zero bits is 24.
- Output: __builtin_clz(255) returns 24.
Example 4: For x = 0xFFFFFFFF (which is 4294967295 in decimal)
- Binary representation: 11111111 11111111 11111111 11111111
- There are no leading zeros.
- Output: __builtin_clz(0xFFFFFFFF) returns 0.
Example 5: For x = 0 (undefined behavior)
- Binary representation: 00000000 00000000 00000000 00000000
- Since x = 0, the result is undefined according to the GCC manual.
- Output: undefined.

Next, you will implement the fp16_to_fp32 function, which converts a 16-bit floating-point number in IEEE half-precision format (bit representation) to a 32-bit floating-point number in IEEE single-precision format (bit representation). This implementation will avoid using any floating-point arithmetic and will utilize the clz function discussed earlier.

static inline uint32_t fp16_to_fp32(uint16_t h) {
    /*
     * Extends the 16-bit half-precision floating-point number to 32 bits
     * by shifting it to the upper half of a 32-bit word:
     *      +---+-----+------------+-------------------+
     *      | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
     *      +---+-----+------------+-------------------+
     * Bits  31  26-30    16-25            0-15
     *
     * S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits.
     */
    const uint32_t w = (uint32_t) h << 16;
    
    /*
     * Isolates the sign bit from the input number, placing it in the most
     * significant bit of a 32-bit word:
     *
     *      +---+----------------------------------+
     *      | S |0000000 00000000 00000000 00000000|
     *      +---+----------------------------------+
     * Bits  31                 0-31
     */
    const uint32_t sign = w & UINT32_C(0x80000000);
    
    /*
     * Extracts the mantissa and exponent from the input number, placing
     * them in bits 0-30 of the 32-bit word:
     *
     *      +---+-----+------------+-------------------+
     *      | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
     *      +---+-----+------------+-------------------+
     * Bits  30  27-31     17-26            0-16
     */
    const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
    
    /*
     * The renorm_shift variable indicates how many bits the mantissa
     * needs to be shifted to normalize the half-precision number. 
     * For normalized numbers, renorm_shift will be 0. For denormalized
     * numbers, renorm_shift will be greater than 0. Shifting a 
     * denormalized number will move the mantissa into the exponent,
     * normalizing it.
     */
    uint32_t renorm_shift = my_clz(nonsign);
    renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
    
    /*
     * If the half-precision number has an exponent of 15, adding a 
     * specific value will cause overflow into bit 31, which converts 
     * the upper 9 bits into ones. Thus:
     *   inf_nan_mask ==
     *                   0x7F800000 if the half-precision number is 
     *                   NaN or infinity (exponent of 15)
     *                   0x00000000 otherwise
     */
    const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
                                 INT32_C(0x7F800000);
    
    /*
     * If nonsign equals 0, subtracting 1 will cause overflow, setting
     * bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result
     * propagates bit 31 across all bits in zero_mask. Thus:
     *   zero_mask ==
     *                0xFFFFFFFF if the half-precision number is 
     *                zero (+0.0h or -0.0h)
     *                0x00000000 otherwise
     */
    const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
    
    /*
     * 1. Shifts nonsign left by renorm_shift to normalize it (for denormal
     *    inputs).
     * 2. Shifts nonsign right by 3, adjusting the exponent to fit in the
     *    8-bit exponent field and moving the mantissa into the correct
     *    position within the 23-bit mantissa field of the single-precision
     *    format.
     * 3. Adds 0x70 to the exponent to account for the difference in bias
     *    between half-precision and single-precision.
     * 4. Subtracts renorm_shift from the exponent to account for any
     *    renormalization that occurred.
     * 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input
     *    was NaN or infinity.
     * 6. ANDs with the inverted zero_mask to set the mantissa and exponent
     *    to zero if the input was zero.
     * 7. Combines everything with the sign bit of the input number.
     */
    return sign | ((((nonsign << renorm_shift >> 3) +
            ((0x70/*C02*/ - renorm_shift) << 23/*C03*/)) | inf_nan_mask) & ~zero_mask);
}

C02 = 0x70
C03 = 0x17

📝 Annotate

The variables w, sign, and nonsign are explained in detail in the comments.

renorm_shift calculates the number of leading zeros. If this number exceeds 5, it indicates that h is denormalized. In such cases, converting to FP32 format increases the representable range, so re-normalization is required.

If the exponent part of nonsign is all ones (0x7F8XXXX, representing NaN or infinity),
(nonsign + 0x04000000) >> 8 will yield 0xFF800000.
The inf_nan_mask will ultimately be 0x7F800000 for these cases (NaN or infinity). Otherwise, it will be 0x00000000.

If nonsign is 0, subtracting 1 results in an overflow, yielding 0xFFFFFFFF.

Final steps:

nonsign << renorm_shift >> 3
This adjusts the mantissa of an FP16 denormalized number by left-shifting it to match the position of the mantissa in an FP32 normalized number.
(0x70 - renorm_shift) << 23
- The difference between the FP32 bias and the FP16 bias is 127 - 15 = 112 = 0x70.
- Subtracting renorm_shift compensates for the left shift performed in step 1.
- The result is then left-shifted by 23 bits to place it in the exponent part of the FP32 format.
Adding the results of steps 1 and 2
- This produces the correct FP32 representation with the adjusted exponent and mantissa.
- Finally, check for special cases such as infinity (INF), NaN, and zero. Once these are handled, add the sign bit to complete the conversion.

Problem D

The C function is_aligned determines whether a given pointer ptr is aligned to the specified alignment. It returns 1 if the pointer is properly aligned and 0 otherwise.

bool is_aligned(void *ptr, size_t alignment) {
    return ((uintptr_t) ptr % alignment) == 0;
}

If the alignment is guaranteed to be a power of 2, the modulo operation can be optimized using a bitwise-AND operation.

You are expected to write a function align_ptr that aligns a pointer to the next address boundary based on the specified alignment. This function adjusts the given pointer to the next memory address that is a multiple of the provided alignment (which must be a power of 2, such as 1, 2, 4, or 8). If the pointer is already aligned, the original pointer will be returned. The alignment operation is performed using bitwise arithmetic without any conditional checks, making the function efficient.

static inline uint8_t *align_ptr(const uint8_t *ptr, uintptr_t alignment) {
    alignment--;  /* Adjust alignment to create a mask for the bitwise operation */
    return (uint8_t *) (((uintptr_t) ptr + alignment) & ~alignment /*D01*/ );
}

D01 = ~alignment

📝 Annotate

Assuming alignment = 8. decrementing alignment-- changes it to 7 (0b0111).

Adding 7 to ptr ensures there is enough space for alignment.
Using & ~7 (0b1111...1000) clears the lower bits of the pointer, effectively aligning the address to the nearest multiple of 8.

Assuming that uintptr_t represents the native word size of the platform (which is typically 4 or 8 bytes), you are required to implement your own version of memcpy that prevents word-sized memory accesses at unaligned addresses, thus avoiding potential performance penalties or misaligned access problems. This implementation is more efficient than byte-by-byte copying, especially for larger memory blocks, as it utilizes word-sized memory transfers where possible.

#include <stddef.h>
#include <stdint.h>

void *my_memcpy(void *dest, const void *src, size_t n) {
    uint8_t *d = (uint8_t *)dest;
    const uint8_t *s = (const uint8_t *)src;

    /* Copy byte-by-byte until destination is word-aligned */
    while (n > 0 && ((uintptr_t)d % sizeof(uintptr_t)) != 0) {
        *d++ = *s++; /*D02*/
        n--;
    }

    /* Copy word-by-word */
    uintptr_t *d_word = (uintptr_t *)d;
    const uintptr_t *s_word = (const uintptr_t *)s;
    while (n >= sizeof(uintptr_t)) {
        *d_word++ = *s_word++; /*D03*/
        n -= sizeof(uintptr_t);
    }

    /* Copy any remaining bytes */
    d = (uint8_t *)d_word;
    s = (const uint8_t *)s_word;
    while (n > 0 /*D04*/ ) {
        *d++ = *s++; /*D05*/
        n--;
    }

    return dest;
}

D02 = *d++ = *s++
D03 = *d_word++ = *s_word++
D04 = n > 0 OR n OR n != 0
D05 = *d++ = *s++

📝 Annotate

First, copy byte by byte until the target address is aligned to the word boundary (4- or 8-byte alignment):

Copy the content: *d = *s
Move the pointers: d++, s++

Second, Copy in word-sized chunks (4 or 8 bytes)

Use word pointers for efficient copying: *d_word++ = *s_word++
This step copies an entire word at a time, improving performance.

Final, handle leftover bytes:

Copy any remaining bytes that are less than a word.
The logic is the same as in the first step.

Problem E

Examine a C program that includes multiple memory bugs, categorized as shown in the following table.

ID	Memory bug
A	Dangling Pointer
B	Mismatched Memory Management Functions
C	Use-After-Fre
D	Invalid Free
E	Memory Leak
F	Out-of-Bounds Access
G	Double Free

#include <stdio.h>
#include <stdlib.h>

/* Bug 1: Memory Leak */
void func1()
{
    int *data = malloc(100 * sizeof(int));
    if (data == NULL) {
        printf("Memory allocation failed\n");
        return;
    }
    /*E01*/
    /* Memory leak: data is never freed */
}

/* Bug 2: Invalid Memory Access (Out-of-Bounds Access) */
void func2()
{
    int *arr = malloc(5 * sizeof(int));
    if (!arr) {
        printf("Memory allocation failed\n");
        return;
    }
    /*E02*/
    /* Invalid memory access: writing beyond allocated memory */
    for (int i = 0; i <= 5; i++)
        arr[i] = i;

    free(arr);
}

/* Bug 3: Double Free */
void func3()
{
    int *ptr = malloc(10 * sizeof(int));
    if (!ptr) {
        printf("Memory allocation failed\n");
        return;
    }

    free(ptr);
    /*E03*/
    /* Double free: freeing the same memory again */
    free(ptr);
}

/* Problem 4: Dangling Pointer */
int *func4()
{
    int *arr = malloc(5 * sizeof(int));
    if (!arr)
        return NULL;
    /*E04*/
    /* Memory is freed, but pointer is returned */
    free(arr);
    return arr;
}

/* Problem 5: Invalid Free (Freeing Stack Memory) */
void func5()
{
    int local_var = 42;
    /*E05*/
    /* Invalid free: trying to free stack-allocated memory */
    free(&local_var);
}

/* Problem 6: Mismatched Memory Management Functions */
void func6()
{
    /* Memory allocated with malloc */
    int *data = malloc(10 * sizeof(int));

    if (!data)
        return;
    /*E06*/
    /* Incorrect deallocation using delete (should use free) */
    delete data;
}

int main()
{
    /* Call each function to demonstrate the memory issues */
    func1();
    func2();
    func3();

    int *dangling_ptr = func4();
    if (dangling_ptr)
        printf("Dangling pointer value: %d\n", dangling_ptr[0]);

    func5();
    func6();

    return 0;
}

Now, you need to identify the causes of the memory bugs by selecting from the provided table and proposing fixes. Fields #E01, #E02, #E03, #E04, #E05, and #E06 must be filled in the format "ID:code", where ID corresponds to the item listed in the table. For example, 'A' stands for 'Dangling pointer,' and 'code' represents your proposed change. If the code is empty, it means the suggested action is to remove the indicated line. For instance, 'D:' indicates the memory bug is classified as 'invalid free,' and the corresponding line should be removed. Note that no spaces are allowed in #E01, #E02, #E03, #E04, #E05, and #E06.

E01 = E:free(data);
E02 = F:for(i=0;i<5;i++)
E03 = G:
E04 = A:
E05 = D:
E06 = B:free(data);

📝 Annotate

The code section has already been thoroughly explained, so no additional annotate are provided here.

Problem F

You are required to write a C function that allocates nodes for a singly linked list with a specified number of nodes. The function dynamically allocates memory for the list, properly linking each node’s next pointer. The data field of each node remains uninitialized. It returns a pointer to the head of the newly created linked list, or NULL if the length is 0. The caller is responsible for freeing the allocated memory.

typedef struct node {
    int data;
    struct node *next;
} node_t;

node_t *list_create(const int length)
{
    node_t *head;
    node_t **ptr = &head;

    for (int i = 0; i < length; ++i) {
        node_t *node = malloc(sizeof(node_t));
        if (!node) {
            printf("Memory allocation failed\n");
            exit(1);
        }

        node->data = i;  /* Initialize the data field for testing */
        
        *ptr = node;
        ptr = &node->next; /*F01*/
    }

    *ptr = NULL;
    return head;
}

F01 = &node->next OR &(node->next)

📝 Annotate

To ensure that the newly added nodes in the next iteration of the loop do not break the chain, ptr must pre-link node->next.

Memory relationship:
ptr → node → node
( ** )→ ( * ) → (data, next)

Quiz 2

Problem A

A01 = addi t1, x0, 32 OR equivalent
A02 = bltz
A03 = bltz
A04 = shift_and_add_loop
A05 = shift_and_add_loop
A06 = andi
A07 = shift_and_add_loop
A08 = sw t0, 0(a4)

📝 Annotate

#A01 # Set bit counter
From the line addi t1, t1, -1 # Decrease bit counter, it can be inferred that the counter is stored in the register t1.
So A01 = addi t1, x0, 32 or an equivalent is used to initialize the counter t1 with the value 32.

#A02 a1, handle_negative1 # If multiplier negative
If multiplier negative ( multiplier < 0 ) jump to handle_negative1
So, #A02 = bltz

#A03 a3, handle_negative2 # If multiplicand negative
Same principle as #A02

j, #A04 # Continue to main loop
jump to shift_and_add_loop: So, #A04 = shift_and_add_loop

j, #A05 # Skip to main loop
Same principle as #A04

#A06 t2, a1, 1 # Check least significant bit
a1 & 0x001 can check least significant bit
So, #A06 = andi

j #A07 # Repeat loop
Same principle as #A04

#A08 # Store final result
Store accumulator t0 to address a4
So, #A08 = sw t0, 0(a4)

Problem B

B01 = -20
B02 = beq
B03 = jal
B04 = jal x1, hanoi
B05 = jalr
B06 = 20
B07 = jalr
B08 = 20
B09 = jalr
B10 = jal
B11 = addi a7, x0

or B11 = li a7,

📝 Annotate

addi sp, sp, #B01 #Allocate stack
Below store 5 register x1, a1, a2, a3 and a4
Each register stores 4 bytes, so the required space is calculated as:5×4=20
So #B01 = -20

#B02 a1, t0, return # If there's only 1 disk, jump to return
if (a1 equal t0(t0=1)) jump to return. So #B02 = beq

#B03 x1, hanoi # Recursive call to hanoi
Since the function needs to return after execution and the return address must be saved to x1, B03 is the jal (Jump and Link) instruction.

#B04 # Recursive call to hanoi
Same principle as #B03

#B05 x0, x1, 0 # Return to the caller
jal x1, hanoi # Call the function: save the return address in x1
jalr x0, x1, 0 # Return: read the address from x1 and jump back
Equivalent to the following in C:

void hanoi() {
    // ...
    return;  // Uses jalr to return
}

Thus, #B05 = jalr.

addi sp, sp, #B06 # Deallocate space on the stack
Based on the instruction lbu a3, 16(sp), it can be determined that #B06 = 20

#B07 x0, x1, 0 # Return to the caller
Same principle as #B05

addi sp, sp, #B08 # Deallocate space on the stack
Since #B01 and #B08 correspond to each other, it follows that #B08 = 20.

#B09 x0, x1, 0 # Return to the caller
Same principle as #B05

#B10 x1, hanoi # Call the hanoi function
Same principle as #B03

#B11, 10 #Exit syscall
RISC-V system call (syscall):

a7 register is used to store the system call number.
li (Load Immediate) instruction is used to load an immediate value.

li a7, 10  
ecall

This is equivalent to the following in C:

exit(0);

Problem C

/* Iterative Tower of Hanoi Using Gray Code */

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>

static void print_move(int disk, char from, char to)
{
    printf("Move Disk %d from '%c' to '%c'\n", disk, from, to);
}

int main()
{
    int n_disks = 3; /* Number of disks */

    int total_moves = (1 << n_disks) - 1;
    const char pegs[3] = {'A', 'B', 'C'}; /* Peg labels */
    int pos[n_disks]; /* Disk positions: 0-A, 1-B, 2-C */

    /* Initialize all disks to peg A */
    for (int i = 0; i < n_disks; i++)
        pos[i] = 0;

    /* Set direction based on parity of number of disks */
    int direction = (n_disks & 1) ? -1 /* counter-clockwise */
                                    : 1; /* clockwise */

    /* Predefined packed mapping arrays */
    const uint8_t peg_map[3] = {
        0x9, /* Peg A: {CCW -> C (2), CW -> B (1)} */ // #C01
        0x2, /* Peg B: {CCW -> A (0), CW -> C (2)} */ // #C02
        0x4  /* Peg C: {CCW -> B (1), CW -> A (0)} */ // #C03
    };

    /* Calculate direction index: -1 -> 0 (CCW), 1 ->1 (CW) */
    int direction_index = (direction + 1) / 2;

    for (int n_moves = 1; n_moves <= total_moves; n_moves++) {
        int curr_gray = n_moves ^ (n_moves >> 1);
        int prev_gray = (n_moves - 1) ^ ((n_moves - 1) >> 1);
        int changed_bit = curr_gray ^ prev_gray;

        /* Identify the disk to move (0-indexed) */
        int disk = __builtin_popcount(changed_bit - 1); // #C04

        /* Current peg of the disk */
        int curr_peg = pos[disk];
        int new_peg;

        if (disk == 0) {
            /* Calculate shift: direction_index=0 (CCW) -> shift2,
             * direction_index=1 (CW) -> shift0
             */
            int shift = (1 - direction_index) << 1;
            new_peg = (peg_map[curr_peg] >> shift) & 0x3; // #C05
        } else {
            /* Find the only peg not occupied by any smaller disk */
            bool found_new_peg = false;
            for (int p = 0; p < 3 && !found_new_peg; p++) {
                if (p == curr_peg)
                    continue;

                /* Check if any smaller disk is on peg p */
                bool has_smaller = false;
                for (int d = 0; d < disk; d++) {
                    if (pos[d] == p) {
                        has_smaller = true;
                        break;
                    }
                }

                if (!has_smaller) {
                    new_peg = p;
                    found_new_peg = true;
                }
            }
        }

        /* Execute the move */
        print_move(disk + 1, pegs[curr_peg], pegs[new_peg]);
        pos[disk] = new_peg;
    }

    return 0;
}

C01 = 0x9
C02 = 0x2
C03 = 0x4
C04 = changed_bit-1
C05 = 0x3

📝 Annotate

Peg A movement options: 0b1001
CCW to C(2) set 0b10, CW to B(1) set 0b01
Peg B movement options: 0b0010
CCW to A(0) set 0b00, CW to C(2) set 0b10
Peg C movement options: 0b0100
CCW to B(1) set 0b01, CW to A(0) set 0b00

int disk = __builtin_popcount(#C04); Decide which disk to move
First observing the movement pattern of the Tower of Hanoi:

The smallest disk (disk 0) moves every 1 step.
The second smallest disk (disk 1) moves every 2 steps.
Disk 2 moves every 4 steps.
Disk 3 moves every 8 steps.
and so on…

Now, observe the pattern of Gray codes:

bit_0 changes every 1 step.
bit_1 changes every 2 steps.
And so on.

This matches the movement pattern of the Tower of Hanoi, where each disk's movement frequency follows the same rule as the corresponding Gray code bit's change frequency.

move   changed_bit   changed_bit-1   popcount   disk
1        1 (0001)      0 (0000)          0        0
2        2 (0010)      1 (0001)          1        1
3        1 (0001)      0 (0000)          0        0
4        4 (0100)      3 (0011)          2        2
5        1 (0001)      0 (0000)          0        0
6        2 (0010)      1 (0001)          1        1
7        1 (0001)      0 (0000)          0        0
8        8 (1000)      7 (0111)          3        3

changed_bit always causes only one bit to change, which is exactly a power of 2
The number of 1's in binary after subtracting 1 exactly corresponds to the disk number to be moved.
#C04 = changed_bit-1

new_peg = (peg_map[curr_peg] >> shift) & #C05

Regardless of the direction (clockwise or counterclockwise), you only need to take 2 bits to determine the new_peg.
#C05 = 0b11 = 0x3

Problem D

D01 = add t2, t2, t1
D02 = srli
D03 = slli
D04 = sw t2, 0(x1) OR equivalent

📝 Annotate

The code section has already been thoroughly explained, so no additional annotate are provided here.

Quiz 3

Problem A

A01 = 0x37
A02 = 0x17
A03 = 7
A04 = 15
A05 = 12
A06 = 20
A07 = -4
A08 = 4

📝 Annotate

A01、A02: The opcode of lui and auipc are as follows

A03~A05:

A06: According to/* imm[11:0] = inst[31:20] */
shift right 20 bits

A07: The function of auipc is as shown in the figure

In the main function, the program counter (PC) was pre-incremented by PC + 4. To correctly execute the auipc instruction, the value of the PC must be restored to its original value. Therefore, it is necessary to subtract 4 from the current PC.

A08: In general, when a jump does not occur, PC = PC + 4

Problem B

B01 = 0
B02 = 32
B03 = 33
B04 = 34
B05 = perform_one_cycle_of_booths_mutliplication(csa_out, multiplicand, multiplier)
B06 = 23
B07 = 15
B08 = 7
B09 = 31
B10 = sign_extend(partial_carry.lo, shift_amount, 64)
B11 = adder(partial_sum.hi, partial_carry.hi, alu_carry_in)

📝 Annotate

Problem C

C01 = srli a1, a1, 8
C02 = beq a2, a4, sq_1
C03 = srai a2, a2, 1
C04 = srli a3, a1, 21
C05 = srai a0, a0, 12
C06 = mul a0, a0, a0
C07 = srli a0, a0, 15
C08 = srai a0, a0, 21
C09 = slli a1, a1, 9
C10 = srai a0, a0, 5
C11 = addi a0, a0, 16
C12 = sltu a4, a3, a0
C13 = slli a1, a1, 16
C14 = slli a2, a2, 23

📝 Annotate

Our goal is to find an approximation for

\frac{1}{\sqrt{y}}

, We can define the equation as:

f (m) = \frac{1}{m^{2}} - y

, and aim to solve for the root of

f (m) = 0

, where m =

\frac{1}{\sqrt{y}}

The Newton-Raphson formula is:

m_{i + 1} = m_{i} - \frac{f (m_{i})}{f^{'} (m_{i})}

First, find the derivative of

f (m_{i}) : f^{'} (m_{i}) = - \frac{2}{m^{3}}

Substitute

f (m_{i})

and

f^{'} (m_{i})

into the Newton-Raphson formula and simplify, resulting in:

m_{n + 1} = m_{n} \times (1.5 - 0.5 \times y \times m_{n}^{2})

Additionally, the basic formula converges faster but has lower precision.
At this point, a correction formula is available, which converges more slowly but is effective for high-precision calculations. It is typically used as the final step for correction. The formula is as follows:

y_{1} = y_{0} - \frac{d y_{0}}{2}

, x is the original input value.

y_{0}

is the initial reciprocal square root approximation.

y_{1}

is the updated approximation.

The explanation of implementing sqrtf in RISC-V assembly code is as follows:

Check if a0 is negative. If it is, jump to sq_0.
Extract the mantissa (23+1 bits) in a1. Left-shift the mantissa by 8 bits, add the hidden 1 from the floating-point representation, then right-shift it by 8 bits (C01 = srli a1, a1, 8) to restore it to its original position.
For example:
```
0 11010110 11010110000000000000000 -> shift left
0 11010110 00000000000000000000000 -> Set bit
1 11010110 00000000000000000000000 -> Restore 
0 00000001 11010110000000000000000
```
Extract exponent and store it in a2. If the exponent is 0, jump to sq_2 (which returns 0). If the exponent is 255 (INF), jump to sq_1 (which returns INF). The instruction for this is C02 = beq a2, a4, sq_1.
Pre-correction for packing:
- The square root calculation halves the exponent. Adding the bias of 127 and then subtracting 2 (resulting in 125) ensures the exponent remains in the correct range after halving, while also preventing overflow or underflow.
Halve the exponent(square root operation halves the exponent) - if odd, double the mantissa.
Using the lookup method to obtain an initial approximation of the square root:
1. Prepare a predefined lookup table (rsqrtapp)
2. Use the high 4 bits of the mantissa as the index
3. Obtain the initial approximation value
4. Use this value as the starting point for Newton-Raphson iteration
- The mantissa originally occupies 23 bits. Adding the hidden 1 bit makes it 24 bits, and with the additional left shift by 1 bit from the previous step, it becomes 25 bits. To extract the top 4 bits of the mantissa, need to shift it right by 21 bits(C04 = srli a3, a1, 21).
Perform the first Newton-Raphson iteration:
- Since the a4 value occupies 8 bits and the mantissa will be multiplied by a4 twice, the mantissa is right-shifted by 7 bits (32 - 8 - 8 - 23). This step ensures that the mantissa is correctly aligned for subsequent multiplications.
- The second mul a0, a0, a4 correspound to
  $y \times m_{n}^{2}$
- Adjust Q32 format to Q20: C05 = srai a0, a0, 12
- Process the correction formula to finally obtain a more accurate
  $m_{n + 1}$
Perform the second Newton-Raphson iteration.
- Square m1:
  $m 1 \times m 1$ -> C06 = mul a0, a0, a0
- Scale from Q32 to Q17, C07 = srli a0, a0, 15
- Scale from Q36 to Q15, C08 = srli a0, a0, 21
Perform the final precision correction and rounding:
- Calculate the remainder: Determine the difference between the current approximation and the actual value.
- Generate a correction term: Compute a small adjustment to refine the approximation further.
- Adjust the approximate solution: Apply the correction term to improve the final result.
- Rounding strategy: Decide on an appropriate rounding method (e.g., round up, round down, or round to the nearest) to achieve the desired precision.

Problem D

📝 Annotate

D01 = Out = (¬A + ¬B)¬C

Alternative:

C must be 0 for the PMOS on the top to conduct: ¬C.
Either A or B must be 0 for the corresponding PMOS on the top to conduct: (¬A + ¬B).

1 and 2 are connected in series, resulting in: ¬C(¬A + ¬B)
When these conditions are satisfied, Out will connect to High and output 1.
Conversely, Out will connect to Low and output 0.

D02 = 00101000
D03 = A

📝 Annotate

Based on the input sequence, the state transition process is as follows:ADABCDADA
The FSM end state is A.

D04 = 3

📝 Annotate

The propagation delay is determined by the longest path's delay.
One of the longest paths is as follows:

D05 = 9

the critical path is
B0 -(1 gate)- P0
P0 -(3 gate)- Cout0
Cout0 -(2 gate)- Cout1
Cout1 -(2 gate)- Cout2
Cout2 -(1 gate)- Σ3

📝 Annotate

Problem E

E01 = 42

Critical path = clk-to-q + longest CL + setup = 30ns for x_input to change (includes clk-to-q) + 9 AND + 3 setup = 42 ns

E02 = 18

Shortest CL: NOT → AND = 5 + 9 = 14ns clk-to-q + shortest CL = 4ns + 14ns = 18ns

E03 = 102

25 + 50 (clock period) + 4 (clk-to-q) + 14 (OR) + 9 (AND) = 102 ns

📝 Annotate

The clock period must ensure that data stabilizes and reaches the next register before the next clock edge. Additionally, we must ensure the data arrives before the setup time of the next clock edge. Thus, the clock period must be:

C l o c k P e r i o d \geq x_i n p u t t o c h a n g e (30 n s) + A N D (9 n s) + s e t u p (3 n s)

The maximum hold time prevents the current data from being overwritten by new data in the next clock cycle. It ensures that the current data can be correctly read. To ensure this, the time for old data to propagate through the shortest path must be greater than the clk-to-q time of the new data.
Thus,

M a x H o l d T i m e = S h o r t e s t C L + c l k_t o_q

When waiting for the next clock period, the wait time already accounts for the delay of x_input chang. The change in x_input occurs before the next clock edge (at 75ns), specifically at 55ns.
Thus, the actual delay affecting the output happens after the next clock edge.
Timeline:

25ns:  First rising edge.
55ns:  x_input changes.
75ns:  Next rising edge (25 + 50).
79ns:  clk-to-q delay (75 + 4).
93ns:  OR gate delay (79 + 14).
102ns: AND gate delay (93 + 9).

Problem F

F01 = 4

We have 9 instructions, so we need 4 bits to represent them.

F02 = 16

Recall that, in RISC-V, the immediates encoded in instructions work in units of half-instructions, so jumping up to 64 KiB in either direction is the same as saying we want to jump up to
$2^{15}$ half instructions in both directions. Since immediates are signed, we need 16 bits to represent this range of values.

📝 Annotate

F01:

⌈ \log_{2} 9 ⌉ = 4

F02: Because the least significant bit (LSB) is implicitly 0:
Each value is effectively a multiple of 2, meaning the actual range that needs to be represented is ±32,768 (65,536 / 2).

Calculating the number of bits needed:

- 32, 768 = - 2^{15}; + 32, 768 = + 2^{15}

This requires 15 bits to represent the range.
Add 1 bit for the sign bit. Thus, a total of 16 bits is needed.

Quiz 4

Problem A

A01 = blt x0, a0, recurse_case
A02 = jalr x0, ra, 0
A03 = addi sp, sp, -12
A04 = sw ra, 8(sp)
A05 = jal ra, square
A06 = lw ra, 8(sp)
A07 = addi sp, sp, 12
A08 = addi sp, sp, -4
A09 = jal ra, sum_of_squares
A10 = addi sp, sp, 4
A11 = jalr x0, ra, 0

📝 Annotate

A01: if n (a0) is less than or equal to zero, do the zero_case. Otherwise, do the recurse_case.

A02: return -> jr ra(pseudo-instructions) = jalr x0, ra, 0

A03: Save 3 reg,

3 \times 4 = 12

A04: return address save in ra

A05: Call square and store pc to ra.

A06: return address save in ra

A07: 3 reg,

3 \times 4 = 12

A08: Save 1 reg,

1 \times 4 = 4

A09: Call sum_of_squares and store pc to ra.

A10: 1 reg,

1 \times 4 = 4

A11: jr ra(pseudo-instructions) = jalr x0, ra, 0

From the perspective of the callee, is it necessary to save any registers in the prologue and epilogue of the function? If so, which registers need to be saved, and where should the prologue and epilogue be placed? If not, provide a brief explanation as to why. __ A12 __

A12 = (一定要有 No 或 Not，意思相近就給分)

No, we do not need to consider callee-saved registers because we are not using any in this function. However, since the function makes two function calls, it may be necessary to save the ra register in the prologue and restore it in the epilogue, right before the jalr x0, ra, 0 instruction, just before the zero_case label.

A13 = sw ra, 0(sp)
A14 = and t3, t2, x1

x1 is ra
andi t3, t2, 1

A15 = slli t1, t1, 1
A16 = srli t2, t2, 1
A17 = bne t2, x0, square_loop
A18 = lw ra, 0(sp)
A19 = jalr x0, ra, 0

📝 Annotate

A13: return address save in ra

A14: To check whether the least significant bit (LSB) of t2 is 1, perform a bitwise AND operation between t2 and 1. The result will indicate the LSB and store in t3.

A15: Shifting t1 left by 1 bit is equivalent to multiplying by 2

A16: Shifting t2 right by 1 bit is equivalent to multiplying by 2

A17: If t2 is not zero, do the square_loop. Otherwise, Continue to execute

A18: Restore return address

A19: jr ra(pseudo-instructions) = jalr x0, ra, 0

Problem B

B01 = 14
B02 = 25
B03 = 40M

The maximum allowable hold time for RegC is determined by the time it takes for the input of RegC to change. This is calculated as the clk-to-q delay of RegA or RegB plus the shortest combinational logic delay:

$4 ns + (5 ns + 5 ns) = 14 ns$
The minimum clock cycle time must account for the clk-to-q delay, the longest combinational logic delay, and the setup time of RegC:

$4 ns + (5 ns + 5 ns + 5 ns) + 6 ns = 25 ns$
A clock cycle time of 25 ns corresponds to a clock frequency of:

$\frac{1}{25 \times 10^{- 9} s} = 40 MHz$

📝 Annotate

B01: shortest combinational logic delay

B02: One of cirtical path

B03:

c l o c k f r e q u e n c y = \frac{1}{c l o c k c y c l e t i m e}

Problem C

C01 =~(k - 1) (或等效的形式)

k & (~(k - 1)):Isolates the least significant bit set in 𝑘, which represents the smallest power of 2 divisor of 𝑘.

C02 = (意思相近就給分，務必提到 power of 2)

Since the division operation is performed by a power of 2, the compiler optimizes it into a bit shift. This results in an efficient 𝑂(𝑏) solution for finding the 𝑘-th element in a left-complete, array-based binary tree.

📝 Annotate

The purpose of this function is to find the position of the K-th smallest element in a subtree of height H, rooted at P, after sorting the elements in ascending order.

Given p = 3, h = 2, k = 3, let's break it down step by step:

p << h = 3 << 2 Calculate the starting position of the layer(leftmost node position)

Binary representation of 3 :   11
Left-shifting by 2 bits :    1100 (=12)

k >> 1 = 3 >> 1 positioned element

Binary representation of 3 :   11
Right-shifting by 1 bit:        1

The or operation | merges location information. At this point, the position of a leaf node of the subtree will be pointed out.
k & ~(k-1) This step is used to find the position of the rightmost 1 in k.

k=3:            0011
k-1=2:          0010
~(k-1):         1101
k & ~(k-1):     0001  (=1)

The final division is used to determine the final position:
Jump k & ~(k-1) nodes up according to the position of the leaf node in the 3rd step

k_child(3, 2, 3) = (((3 << 2) | (3 >> 1)) / (3 & ~(2)))
                  = ((12 | 1) / 1)
                  = 13

For the subtree at position 3 and height 2, the 3rd largest element is at position 13

Problem D

D01 =

- 2^{18}

D02 =

2^{18} - 1

The immediate field for the jal instruction is 20 bits, while for the jalr instruction, it is only 12 bits. This means jal can cover a wider range of addresses. Similar to the previous addressing mode, the 20-bit immediate value is multiplied by 2 and added to the program counter (PC) to determine the target address. Since the immediate value is signed, it spans a range of
$[- 2^{20}, 2^{20} - 2]$ bytes, or
$[- 2^{19}, 2^{19} - 1]$ 2-byte instructions. However, because we are dealing with 4-byte instructions, the effective range is
$[- 2^{18}, 2^{18} - 1]$ instructions relative to the current PC.

D03 = "jalr"
D04 = "lbu"
D05 = "lhu"
D06 = "slli"
D07 = "slti"
D08 = "sltiu"
D09 = "andi"
D10 = "ecall"
D11 = "csrrs"
D12 = "bltu"
D13 = "bgeu"
D14 = "lui"
D15 = "auipc"
D16 = "jal"
D17 = e.u.i31_12
D18 = e.j.i10_1 << 1
D19 = e.s.i11_5 << 5
D20 = e.b.i4_1 << 1

📝 Annotate

Please see the link for RV32 instruction set architecture:
https://cs61c.org/fa24/pdfs/resources/reference-card.pdf
https://five-embeddev.com/riscv-user-isa-manual/Priv-v1.12/instr-table.html
D03:

D04:

D05:

D06:

D07:

D08:

D09:

D10:

D11:

D12:

D13:

D14:

D15:

D16:

D17:

D18:The purpose of the left shift is to recombine the immediate bits scattered at different locations in the instruction into a complete offset value

D19:

D20:

Problem E

E01 = 665

sw = clk-to-Q + Mem-Read + max(RegFileRead + Mux(ASel), Imm-Gen +
Mux(BSel)) + ALU + DMEM Setup
= 5ns + 300 ns + 60ns + 100ns + 200ns
= 665ns

E02 = 800

lw = clk-to-Q + Mem-Read + max(RegFileRead + Mux(ASel), Imm-Gen +
Mux(BSel)) + ALU + Mem-Read + Mux(WBSel) + RegFileSetup
= 5ns + 300ns + 60ns + 100ns + 300ns + 15ns + 20ns
= 800ns

E03 = 500

jal= clk-to-Q + Mem-Read + Imm-Gen + Mux(BSel) + ALU + Mux(PCSel) + PCSetup
= 5ns + 300ns + 45ns + 15ns + 100ns + 15ns + 20ns
= 500ns

E04 = PC + 4 注意: E04-E06 的順序可換
E05 = ALUOut
E06 = MEM
E07 = Inst

E08 = PC
E09 = ALUOut 注意: E09 和 E10 的順序可換
E10 = RegReadData2
E11 = Inst

📝 Annotate

E01: The red line is the critical path

E02:

E03:

E04~E06:

E07:
INST represents the instruction itself.
The control logic needs to determine the next operation based on the content of the instruction.
Necessity of Passing INST in a Pipeline:

Instruction Awareness in Each Stage: Every stage must know what instruction is currently being processed.
Accurate Control Signals: This allows the control logic in the next stage to generate the correct control signals.
Correct Stage Execution: Ensures that each stage operates correctly based on the specific instruction being processed.

E08~E10:

E11: Same as E07.

Problem F

F01 = 1
F02 = 2
F03 = 1
F04 = MEM
F05 = EX
F06 = WB
F07 = EX

📝 Annotate

F01~F03: addi rd = and rs1 = sltiu rs1

F04~F07: forwarding path

F08 = 2
F09 = 3
F10 = 3
F11 = 4
F12 = 2
F13 = 3
F14 = 3
F15 = 4
F16 = 4
F17 = 3
F18 = 3
F19 = 4

F20 = 2-3-1-4

📝 Annotate

F08~F09: addi rd = lw rs1
addi t0, t0, 4, lw t1, 0(t0)
F10~F11: lw rd = add rs1
lw t1, 0(t0), add t2, t1, x0
F12~F13: This hazard can be resolved by MEM to EXE

F14~F19: Between instructions 3 and 4 are Load-Use Data Hazard，Since the lw instruction retrieves its data in the MEM stage, but the subsequent add instruction requires that data in the EXE stage, the pipeline must stall for 1 cycle.
In the next cycle, the data becomes available in the WB stage and can then be forwarded to the EXE stage of the add instruction. This ensures correct execution while addressing the data hazard.

F20:
The original instruction order is 1-2-3-NOP-4. Since instruction 1 has no dependencies, it can be inserted between instruction 3 and instruction 4, replacing the NOP.
The new instruction order becomes 2-3-1-4.

F21 = 4

There are four hazards: a data hazard from t1 between instructions 1 and 2, a data hazard from s0 between instructions 2 and 3, another data hazard from s0 between instructions 2 and 4, and a control hazard between instructions 4 and 5.

F22 = 2

Assuming the RegFile allows simultaneous read and write operations in the same cycle, two stalls are needed between instructions 1 and 2 (WB → ID), and another two stalls are needed between instructions 2 and 3 (WB → ID).

Problem G

G01 = reg.io.q
G02 = Mux(count === 9.U, 0.U, result)
G03 = reg.io.d := next

G04 = val y = Output(UInt (16.W))
G05 = io.y := io.a & io.b

G01:val count = reg.io.q?
G02:val next = Mux(count === 9.U, 0.U, result)?
G05:is (3.U) { io.y := io.a & io.b }?

📝 Annotate

G01: Connects the output of the register to the corresponding line for count.

G02: When count equals 9, select 0 as the output.
Otherwise, select result (the output of the adder) as the output.
This implements the functionality of resetting the counter to 0 after reaching 9.

G03: Connects the output of the Mux(next) to the input of the register.

G04: The form of declaring the output y.

G05: Implement the remaining functionality and(&).

Problem H

H01 = srli t2, a0, 31
H02 = srli t2, a1, 31
H03 = bgeu a2, t0, overflow_xe_gte_ye
H04 = srli a4, a4, 9
H05 = slli t2, t2, 31
H06 = 0x7F800000
H07 = slli t2, t2, 31

📝 Annotate

H01、H02: The sign bit is stored in the most significant bit, so performing a logical right shift of 31 bits will retrieve the sign bit.

H03: if a2 > 255(t0) jump to overflow_xe_gte_ye

H04: In the previous operation, the significand was left-shifted to the highest bit. After completing the computation, it needs to be right-shifted by 9 bits to align properly.

H05: The sign is in the least significant bit. Left-shifting it by 31 bits moves it back to its correct position.

H06:

H07: Same as H05

H08 = sll a1, a1, t1
H09 = slli a1, a1, 10
H10 = slli a0, a0, 23

📝 Annotate

H08: t1 = -A0 = -(leading zeros)
A1 << (32 - leading_zeros) is equivalent to A1 << (-leading_zeros).
Characteristics of shifting with negative numbers:
When using a negative value for a left shift, it is effectively interpreted as 32 + (-n) for the shift amount.
Thus, -leading_zeros for a left shift is the same as shifting by (32 - leading_zeros).

H09: The previous instruction shifts right by 9 bits.
Adding the step to remove the hidden '1' results in a total left shift of 10 bits.

H10: Left-shift the exponent by 23 bits.
This aligns it to the position where the exponent resides in the IEEE 754 format.

Quiz 5

Problem A

A01

40%
Since the L1 cache has a local miss rate of 50%, it means L2 cache is accessed 50 times out of the 100 total memory accesses. Given that L2 cache experienced 20 misses, the local miss rate of L2 cache is calculated as:
$Local Miss Rate of L2 = \frac{20 misses}{50 accesses} \times 100 = 40 %$

A02

25%
The number of accesses to the L2 cache equals the number of misses in the L1 cache. To find the local miss rate of the L2 cache, we divide its global miss rate by the miss rate of the L1 cache:

$L2 Local Miss Rate = \frac{Misses in L2}{Accesses in L2} = \frac{\frac{Misses in L2}{Total Accesses}}{\frac{Misses in L1}{Total Accesses}} = \frac{Global Miss Rate of L2}{Miss Rate of L1}$
Substituting the values:

$L2 Local Miss Rate = \frac{5 %}{20 %} = 0.25 = 25 %$

A03

30`
Let ( H ) represent the hit time of the L3 cache. By extending the AMAT equation, where the miss penalty of the L2 cache becomes the local AMAT of the L3 cache, we can express the system's AMAT as:

$A M A T = 2 + 0.2 \times (15 + 0.25 \times (H + 0.3 \times 100))$
To ensure the AMAT does not exceed 8 cycles, we solve the inequality:

$2 + 0.2 \times (15 + 0.25 \times (H + 30)) \leq 8$
Simplifying this yields:

$H \leq 30$
Thus, the maximum hit time for the L3 cache is 30 cycles.

A04~A06

6, 4, 10
The breakdown of the Tag:Index:Offset (T/I/O) is calculated as follows:

Offset =
$\log_{2} (1 KiB) = \log_{2} (2^{10}) = 10$ bits
Index =
$\log_{2} (\frac{16 KiB}{1 KiB}) = \log_{2} (16) = 4$ bits
Tag = Total address bits - Index bits - Offset bits
$Tag = 20 - 4 - 10 = 6 bits$

A07

50%
Since integer accesses are spaced
$4 \times 128 = 512$ bytes apart, each cache block holds two accesses. The first access to each block results in a compulsory miss, while the second access hits because both A[i] and A[i + 128] reside in the same cache block. This results in a cache hit rate of 50%.

A08

50%
The size of array A is
$8192 \times 4 = 2^{15}$ bytes, which is exactly twice the size of the cache. After executing Line 7, the second half of A resides in the cache, but Line 11 starts processing from the first half of A, meaning none of the cached data from Line 7 can be reused. Since memory accesses in Line 11 follow the same pattern as Line 7, the cache hit rate remains the same at 50%. The variable total is likely kept in a register by the compiler, so it does not impact the cache hit calculation.

A09 = ? A10 = ? A11 = ? A12 = ?
1100, 1, 000, Miss or M

A13 = ? A14 = ? A15 = ? A16 = ?
1101, 1, 101, Replace or R

A17 = ? A18 = ? A19 = ? A20 = ?
0011, 0, 000, Replace or R

A21

27.3% or
$\frac{3}{11}$
There are 3 hits out of 11 total accesses, resulting in a hit rate of approximately 27.3%.

📝 Annotate

Problem B

B01

4
There are four hazards present:
A data hazard between instructions 1 and 2 due to t1.
A data hazard between instructions 2 and 3 caused by s0.
A data hazard between instructions 2 and 4 also involving s0.
A control hazard between instructions 4 and 5 caused by the branch instruction.

B02~B11

B02 = ?
2
B03 = ?
1
B04 = ?
2
B05 = ?
2
B06 = ?
2
B07 = ?
3
B08 = ?
2
B09 = ?
4
B10 = ?
no (or zero)
B11 = ?
3

Instruction	C1	C2	C3	C4	C5	C6	C7	C8	C9
1.`sub t1, s0, s1`	IF	ID	EX	MEM	WB
2.`or -> nop`		IF	ID
2.`or s0, t0, t1`			IF	ID	EX	MEM	WB
3.`sw s1, 100(s0)`				IF	ID	EX	MEM	WB
4.`bgeu s0, s2, loop`					IF	ID	EX	MEM	WB
5.`add t2, x0, x0`						IF	ID	EX	MEM

📝 Annotate

Problem C

📝 Annotate

C01: Using Branch Comp to compare the rs1 and rs2 of the beq instruction. If they are equal, Branch Comp sets BrEq to 1; otherwise, it sets to 0.

C02: The jal instruction does not compare rs1 and rs2; therefore, BrLt is a "*(don't care)" value.

C03: Using Branch Comp to compare the rs1 and rs2 of the beq instruction. If rs1 less than rs2, Branch Comp sets BrLt to 1; otherwise, it sets to 0.

C04~C08: Only branch or jump instructions cause a jump.

For other instructions, set PCSel to 0.
For jump instructions, set PCSel to 1.
For branch instructions, determine PCSel based on BrEq and BrLt, setting it to 0 or 1 accordingly.

C09~C10: BrUn is Branch Unsigned，1 determines signed comparison, and 0 performs unsigned comparison.

For the beq instruction, the result is the same regardless of signed or unsigned comparison, so BrUn is "*(don't care)"
For branch less than, signed or unsigned comparison matters:
- The blt instruction requires signed comparison, so BrUn should be set to 0.
- The bltu instruction requires unsigned comparison, so BrUn should be set to 1.

C11~C15:

lw (load word): Loads data from memory into register, so set RegWEn to 1.
sw (store word): Stores data from a register into memory but does not write to a register, so set RegWEn to 0.
Branch instructions: Do not write to any register, so set RegWEn to 0.
jal (Jump and Link): The instruction format is jal rd, label, where it performs rd = PC + 4 and PC = rs1 + imm. Since it writes to the register rd and updates the PC, set RegWEn to 1.

C16: WBSel (Write Back Select) determines the source of the data to be written into the register:

0: Data comes from memory.
1: Data comes from the result of ALU computation.
2: Data comes from PC + 4
For the jal instruction, which writes PC + 4 into rd, WBSel is set to 2.

Problem D

D01 = ?

sll s3, s3, s2

D02 = ?

and s3, a1, s3

D03 = ?

sll s5, s5, s4

D04 = ?

or s1, s1, s5

D05 = ?

addi s0, s0, 1

================
D06 = ?

sub t1, t1, t0

📝 Annotate

https://hackmd.io/@vx6122035/2024-quiz7-problem-d

Problem E

E01 = ? E02 = ? E03 = ?

0x13, 0x17, 1

E04 = ? E05 = ? E06 = ?

0x10, 0x13, 1

E07 = ? E08 = ? E09 = ?

0x20, 0x12, 1

E10 = ? E11 = ? E12 = ?

0x23, 0x18, 1

E13 = ? E14 = ? E15 = ?

0x11, 0x14, 0

E16 = ? E17 = ? E18 = ?

0xac, 0x15, 1

E19 = ? E20 = ? E21 = ?

0x34, 0x19, 1

Access 0x11F0 (Read): Hit, Physical Address: 0x14F0; Updated LRUs: 1, 7, 2, 5, 7, 0, 3, 4
Corresponding Virtual Page: 0x11

Access 0x1301 (Write): Miss, Map VPN 0x13 to PPN 0x17, set valid and dirty bits.
Physical Address: 0x1701; Updated LRUs: 2, 0, 3, 6, 7, 1, 4, 5

Access 0x20AE (Write): Hit, set dirty bit.
Physical Address: 0x12AE; Updated LRUs: 3, 1, 4, 0, 7, 2, 5, 6

Access 0x2332 (Write): Miss, Map VPN 0x23 to PPN 0x18, set valid and dirty bits.
Physical Address: 0x1832; Updated LRUs: 4, 2, 5, 1, 0, 3, 6, 7

Access 0x20FF (Read): Hit, Physical Address: 0x12FF; Updated LRUs: 4, 2, 5, 0, 1, 3, 6, 7

Access 0x3415 (Write): Miss (Replacement Needed). Replace the last entry, Map VPN 0x34 to PPN 0x19, set dirty bit.
Physical Address: 0x1915; Updated LRUs: 5, 3, 6, 1, 2, 4, 7, 0

📝 Annotate

From the statement:
16-bit addresses: This indicates the total virtual address space is 16 bits.
256-byte pages: Since a page size is 256 bytes, the Page Offset requires

\log_{2} (256) = 8

bits.
The remaining bits 16 - 8 = 8 correspond to the Virtual Page Number (VPN).

Problem F

F01 = ?

4
With a memory size of 4KiB divided into 64B chunks, there are 64 total chunks. This requires 4 exponent bits to cover values from 0 to 63. The highest exponent value before applying the standard bias (excluding infinity/NaN values) is
$2^{4} - 2 = 14$
Applying the standard bias of -7 results in a maximum exponent value of 7, sufficient to cover values up to 63.
Conversely, with only 3 exponent bits, the maximum exponent value after applying a bias of -3 is 3, insufficient to represent the value 63.

F02 = ?

11
The largest memory address to represent is
$0 b 11_1111.111111$ , corresponding to
$63 + \frac{63}{64}$ . To encode this value, we need 11 mantissa bits, resulting in the representation &0b1.11111111111 \times 2^5&. Thus, 11 bits of mantissa are necessary.

📝 Annotate

F01: Focusing only on the EXPONENT part

4 bits:
The maximum value that can be represented is 0b1110 (since 0b1111 is reserved for INF).
Subtracting the bias (0b0111), the result is
$2^{7} = 128$ which is greater than 63. Therefore, 4 bits are sufficient to represent all segments.
3 bits:
The maximum value that can be represented is 0b110.
Subtracting the bias (0b011), the result is
$2^{3} = 8$ , which is less than 63. Hence, 3 bits are not sufficient to represent all segments.

Problem G

G01 = ?

maskeq_epi32(tmp, curr)

G02 = ?

firstv_epi32(tmp)

G03 = ?

arr[i] > running_max (or equivalent)

📝 Annotate

In the first for loop, the steps are as follows:

Load Elements in Groups of Four:
- Four elements are loaded into the vector tmp.
- The maximum value in tmp is loaded into curr.
Compare curr with running_max:
- If curr is greater than running_max, proceed to the subsequent steps.
Mask Elements Equal to curr:
- Create a mask for elements in tmp that are equal to curr.
- Example:
```
maskeq_epi32([2, 4, 8, 6], 8) → [0, 0, 0xFFFFFFFF, 0]
```
- This isolates the elements equal to the current maximum.
Find the Index:
- Use index = i + firstv_epi32(tmp) to determine the position of the first 0xFFFFFFFF bit in the mask.
- Since 0xFFFFFFFF has its last bit set to 1, this gives the correct index.
Update running_max and Compare Remaining Elements:

In the second loop, the remaining numbers (those fewer than 4 elements) are handled using a normal element-by-element comparison.
Reason for This Design:
SIMD (Single Instruction, Multiple Data) operations work on fixed-size groups of elements, which in this case is 4. If the total number of elements is not divisible by 4, the leftover elements cannot be processed using SIMD instructions.

Problem H

H01 = ?

4
There are 9 instructions in total, so we need 4 bits to represent them, as
$⌈ \log_{2} (9) ⌉ = 4$

H02 = ?

16
Jumping 64 KiB in either direction equals
$2^{16}$ bytes. Since immediate values are counted in half-instructions, we need 15 bits for the signed offset, covering
$- 2^{15}$ to
$2^{15} - 1$ . An extra bit is required for the sign, making the immediate field 16 bits.

H03 = ?

128
Total bits used by immediates and the opcode:
$2 \times 11 + 7 = 29 bits$
Remaining bits for registers:
$64 - 29 = 35 bits$
There are 5 registers (rs1, rs2, rs3, rd1, rd2). Dividing the remaining bits equally among the registers:
$\frac{35}{5} = 7 bits per register$
This allows
$2^{7} = 128$ registers.

📝 Annotate

Problem I

I01 = ?

4096 (Compulsory Misses for Loop A)

I02 = ?

0 (No Conflict or Capacity Misses for Loop A)

I03 = ?

4096 (Total Misses for Loop A)

I04 = ?

4096 (Compulsory Misses for Loop B)

I05 = ?

28672 (Conflict and Capacity Misses for Loop B)

I06 = ?

32768 (Total Misses for Loop B)

📝 Annotate

First, calculate some key parameters
- Cache size = 32 KiB =
  $2^{15}$ bytes
- Line size = 8 words = 32 bytes =
  $2^{5}$ bytes
- Number of lines =
  $2^{15} / 2^{5}$ =
  $2^{10}$ lines
- Sets =
  $2^{10} / 4$ =
  $2^{8}$ sets = 256sets (4-way set-associative)
Matrix dimensions:
- i goes from 0 to 255(
  $2^{8}$ ), j goes from 0 to 63(
  $2^{6}$ )
- Each element is 4 bytes
- Row size = 64 × 4 = 256 bytes = 8 cache lines

For Loop A (i,j):

Compulsory misses (I01): Each cache line is loaded once for first access
- Each row needs 8 cache lines
- 256 rows × 8 lines × 2 matrices = 4,096
- I01 = 4,096 misses
Conflict/Capacity misses(I02):
- Row-major access means good spatial locality
- Inner loop accesses consecutive elements
- 4-way set associative with FIFO can hold multiple rows
- No additional misses, I02 = 0
Total(I03) = 4096 + 0 = 4096

For Loop B (j,i):

Compulsory misses (I04): Same as Loop A, I04 = 4096
Conflict/Capacity misses (I05):
- Column-major access pattern
- Each column element is 256 bytes(8 lines) apart
- $256 s e t s \div 8 l i n e s \times 4 (4 w a y) \div 2 (2 m a t r i x s) = 64$
- Every 64 accesses to elements will completely overwrite the old data in the cache.
- $64 < 256$ (inner loop access), so when loading the next element in the same row, the cache has already been overwritten by other data (conflict miss).
- Each access results in a miss, so
  $I 05 = I 06 (T o t a l m i s s) - I 04 (C o m p u l s o r y m i s s e s)$
Total(I06) =
$2^{8} \times 2^{6} \times 2 = 32768$
- I05 = 32768 - 4096 = 28672

I07 = ?

4096 (L2 Cache Misses for Loop A)

I08 = ?

4096 (L2 Cache Misses for Loop B)

📝 Annotate

L2 cache:

Cache size = 256 KiB =
$2^{18}$ bytes
Line size = 8 words = 32 bytes =
$2^{5}$ bytes
Number of lines =
$2^{18} / 2^{5}$ =
$2^{13}$ lines
Sets =
$2^{13} / 8$ =
$2^{10}$ sets = 1024 sets (8-way set-associative)

I07:
Compulsory misses:
Each cache line is loaded once for first access -> 4096miss
Conflict and Capacity Misses: good spatial locality -> No additional misses
Total = 4096 miss

I08:
Compulsory misses: 4096 miss
Conflict and Capacity Misses:

1024 s e t s \div 8 l i n e s \times 8 (8 w a y) \div 2 (2 m a t r i x s) = 512

8-way set associative with FIFO can hold all rows -> No additional misses
Total = 4096 miss

I09 = ?

11.75 cycles (AMAT for Loop A)

I10 = ?

22.25 cycles (AMAT for Loop B)

📝 Annotate

AMAT =

L 1 h i t t i m e + (L 1 m i s s r a t e \times (L 2 h i t t i m e + (L 2 m i s s r a t e * D R A M h i t t i m e))

I09:

4 + (\frac{4096}{2^{8} \times 2^{6} \times 2} \times (12 + (\frac{4096}{4096} \times 50)) = 11.75

I10:

4 + (\frac{32768}{2^{8} \times 2^{6} \times 2} \times (12 + (\frac{4096}{32768} \times 50)) = 22.25

I11 = ?

Yes (Virtual Address Aliasing Possible)

I12 = ?

8 (Minimum Set-Associativity Required)

📝 Annotate

Page size = 4 KiB
Cache size = 32 KiB

Virtual address aliasing problem:
When the cache size exceeds(>) the page size, different virtual addresses may map to the same physical address, causing an aliasing problem.
Here, 32 KiB > 4 KiB, so aliasing issues may occur.

To avoid the aliasing problem, a sufficient set associativity is required.
The set associativity must be at least equal to the cache size divided by the page size to ensure that the same physical address does not cause conflicts.

32 KiB / 4 KiB = 8, meaning the cache needs to be at least 8-way set associative to prevent aliasing.