曹為廷
You are required to write C programs that convert between different floating-point formats. One of these formats is the half-precision floating-point, also known as FP16
or float16
, which is a compact representation designed to save memory. Half-precision numbers are increasingly used in fields like deep learning and image processing. Unlike the single-precision floating-point format (float32
), defined by IEEE 754, FP16 has fewer bits allocated for the exponent.
┌ sign
│
│ ┌ exponent
│ │
│ │ ┌ mantissa
│ │ │
│┌─┴─┐┌─┴──────┐
0b0000000000000000 IEEE binary16
The functions for converting between FP16 and float32 (FP32) are provided below.
static inline float bits_to_fp32(uint32_t w)
{
union {
uint32_t as_bits;
float as_value;
} fp32 = {.as_bits = w};
return fp32.as_value;
}
static inline uint32_t fp32_to_bits(float f)
{
union {
float as_value;
uint32_t as_bits;
} fp32 = {.as_value = f};
return fp32.as_bits;
}
Consider the fp16_to_fp32
function, which converts an FP16 value to a single-precision IEEE floating-point (float32) format.
static inline float fp16_to_fp32(fp16_t h)
{
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t two_w = w + w;
const uint32_t exp_offset = UINT32_C(0xE0) << 23;
const float exp_scale = 0x1.0p-112f;
const float normalized_value =
bits_to_fp32((two_w >> 4) + exp_offset) * exp_scale;
const uint32_t mask = UINT32_C(126) << 23;
const float magic_bias = 0.5f;
const float denormalized_value =
bits_to_fp32((two_w >> 17) | mask) - magic_bias;
const uint32_t denormalized_cutoff = UINT32_C(1) << 27;
const uint32_t result =
sign | (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value)
: fp32_to_bits(normalized_value));
return bits_to_fp32(result);
}
Next, complete the fp32_to_fp16
function, which is responsible for converting a float32 value to FP16. The identifiers starting with #A
indicate the sections where you need to provide the appropriate answers.
static inline fp16_t fp32_to_fp16(float f)
{
const float scale_to_inf = 0x1.0p+112f;
const float scale_to_zero = 0x1.0p-110f;
float base = (fabsf(f) * scale_to_inf) * scale_to_zero;
const uint32_t w = fp32_to_bits(f);
const uint32_t shl1_w = w + w;
const uint32_t sign = w & UINT32_C(0x80000000);
uint32_t bias = shl1_w & UINT32_C(0xFF000000);
if (bias < UINT32_C(0x71000000))
bias = UINT32_C(0x71000000);
base = bits_to_fp32((bias >> 1) + UINT32_C(0x07800000/*A01*/)) + base;
const uint32_t bits = fp32_to_bits(base);
const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00);
const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF/*A02*/);
const uint32_t nonsign = exp_bits + mantissa_bits;
return (sign >> 16) |
(shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00/*A03*/) : nonsign);
}
The values for #A01
, #A02
, and #A03
must be expressed as hexadecimal integer literals in their shortest form. For example, use 0x7C00
rather than 0x007C00
.
A01 = 0x7800000
A02 = 0xFFF
A02 = 0x7FF
~0x7FFFFF
also work
As long as extract the rightmost 11 bits without including the exponent and sign parts, it will still work correctly.
A03 = 0x7E00
First, use scale_to_inf
and scale_to_zero
to detect whether the value f
is outside the representable range of FP16.
For example:
0x71000000
, which will ultimately be converted to 0 (underflow).INF
(overflow). Finally, through the condition (shl1_w > UINT32_C(0xFF000000))
, it will return FP16 infinity 0x7E00
.Next, let w
be the 32-bit integer representation of the float f
, and let shl1_w
be the value of w
left-shifted by one bit.
Then, extract the most significant bit of w
as the sign
.
The bias
is obtained by extracting the leftmost 8 bits of shl1_w
, which corresponds to the exponent part of f
.
The reason why the minimum value of bias is 0x71000000
is as follows:
In FP32, the exponent bias is 127
, while in FP16, the exponent bias is 15
. When converting a 32-bit floating-point number to a 16-bit floating-point number, we need to adjust the exponent.
Specifically, we subtract (127 - 15), which is 112, from the FP32 exponent. This is why the bias is set to no less than 0x71000000
(which equals 113 << 24). This value represents the smallest non-zero exponent that can be represented in FP16.
The value 0x07800000
corresponds to the exponent bias of 15 (the FP16 bias) shifted left by 23 bits.
Thus, the expression (bias >> 1) + UINT32_C(0x07800000)
can be understood as:
The operation base_new = bits_to_fp32((bias >> 1) + UINT32_C(0x07800000)) + base_old
effectively calculates:
According to the calculation above, the exponent of base_new
is equal to the exponent of f
+ 15.
Then, based on the rules of floating-point addition, the mantissa of base_new
becomes the mantissa of f
right-shifted by (15 - 2) bits (including rounding).
(15-2 is because during scaling, base
is 4 times larger than f
, so it is right-shifted 2 bits less .)
exp_bits
takes bits 28 to 24 of base_new as the exponent for the FP16 format and is shifted right to align with bits 15 to 11.
Next, mantissa_bits
takes the rightmost 10 bits of base_new
as the mantissa for the FP16 format.
The reason for taking 1 extra bit for mantissa_bits
is that the exponent has a bias, which compensates for this extra bit, ensuring proper alignment with exp_bits.
sign
and nonsign
to obtain the final FP16 representation.
The bfloat16 format is a 16-bit floating-point representation, designed to provide a wide dynamic range by using a floating radix point. It is a shortened version of the 32-bit IEEE 754 single-precision format (binary32), aimed at accelerating machine learning.
The structure of the bfloat16 floating-point format is as follows.
┌ sign
│
│ ┌ exponent
│ │
│ │ ┌ mantissa
│ │ │
│┌──┴───┐┌─┴───┐
0b0000000000000000 bfloat16
Since bfloat16 (bf16) shares the same number of exponent bits as a 32-bit float, encoding and decoding values is relatively simple.
static inline float bf16_to_fp32(bf16_t h)
{
union {
float f;
uint32_t i;
} u = {.i = (uint32_t)h.bits << 16};
return u.f;
}
You are required to implement the fp32_to_bf16
function, which converts a float32 to bfloat16. This conversion should be binary identical to the BFloat16 format. Floats must round to the nearest even value, and NaNs should be quiet. Subnormal values should not be flushed to zero, unless explicitly needed.
static inline bf16_t fp32_to_bf16(float s)
{
bf16_t h;
union {
float f;
uint32_t i;
} u = {.f = s};
if ((u.i & 0x7fffffff) > 0x7f800000/*B01*/) { /* NaN */
h.bits = (u.i >> 16) | 64; /* force to quiet */
return h;
}
h.bits = (u.i + (0x7fff/*B02*/ + ((u.i >> 0x10/*B03*/) & 1))) >> 0x10/*B04*/;
return h;
}
B01 = 0x7f800000
B02 = 0x7fff
B03 = 0x10
Why is it 16 instead of 15?
B04 = 0x10
The largest representable number in FP32 is infinity, with a hexadecimal representation of 0x7f800000
.
If the binary representation exceeds 0x7f800000, it indicates that the exponent of s
is 11111111
, and the mantissa is not all zeros. In this case, the value is forced to be a quiet NaN (qNaN) and returned.
(A quiet NaN should have the highest bit of the mantissa set to 1, so 64 is forcibly OR-ed with the mantissa to ensure this.)
((u.i >> 16) & 1)
is used to check whether the 17th bit of s
is 1, in order to determine whether rounding is necessary.
0x7fff + ((u.i >> 16) & 1)
:
0x8000
.0x7fff
.Add this result to u.i
, then right-shift the sum by 16 bits to extract the upper 16 bits as the final result.
In Problem A, the fabsf
function was used within the fp32_to_fp16
function. Now, you are required to implement your own version of fabsf
.
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; /*C01*/ // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
C01 = 0x7FFFFFFF
Mask the highest bit to obtain the absolute value.
Consider the C implementation of __builtin_clz
:
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
The __builtin_clz
function is a GCC built-in function that counts the number of leading zero bits in the binary representation of an unsigned integer, starting from the most significant bit (MSB). It returns an integer representing how many zero bits precede the first 1
in the binary representation of the number.
For example, given the number x
, __builtin_clz(x)
counts how many 0s are at the beginning of the binary form of x
.
Examples of __builtin_clz
x = 16
00000000 00000000 00000000 00010000
27
(as there are 27 zeros before the first 1
).__builtin_clz(16)
returns 27
.x = 1
00000000 00000000 00000000 00000001
31
.__builtin_clz(1)
returns 31
.x = 255
00000000 00000000 00000000 11111111
24
.__builtin_clz(255)
returns 24
.x = 0xFFFFFFFF
(which is 4294967295
in decimal)
11111111 11111111 11111111 11111111
__builtin_clz(0xFFFFFFFF)
returns 0
.x = 0
(undefined behavior)
00000000 00000000 00000000 00000000
x = 0
, the result is undefined according to the GCC manual.Next, you will implement the fp16_to_fp32
function, which converts a 16-bit floating-point number in IEEE half-precision format (bit representation) to a 32-bit floating-point number in IEEE single-precision format (bit representation). This implementation will avoid using any floating-point arithmetic and will utilize the clz
function discussed earlier.
static inline uint32_t fp16_to_fp32(uint16_t h) {
/*
* Extends the 16-bit half-precision floating-point number to 32 bits
* by shifting it to the upper half of a 32-bit word:
* +---+-----+------------+-------------------+
* | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
* +---+-----+------------+-------------------+
* Bits 31 26-30 16-25 0-15
*
* S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits.
*/
const uint32_t w = (uint32_t) h << 16;
/*
* Isolates the sign bit from the input number, placing it in the most
* significant bit of a 32-bit word:
*
* +---+----------------------------------+
* | S |0000000 00000000 00000000 00000000|
* +---+----------------------------------+
* Bits 31 0-31
*/
const uint32_t sign = w & UINT32_C(0x80000000);
/*
* Extracts the mantissa and exponent from the input number, placing
* them in bits 0-30 of the 32-bit word:
*
* +---+-----+------------+-------------------+
* | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
* +---+-----+------------+-------------------+
* Bits 30 27-31 17-26 0-16
*/
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
/*
* The renorm_shift variable indicates how many bits the mantissa
* needs to be shifted to normalize the half-precision number.
* For normalized numbers, renorm_shift will be 0. For denormalized
* numbers, renorm_shift will be greater than 0. Shifting a
* denormalized number will move the mantissa into the exponent,
* normalizing it.
*/
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
/*
* If the half-precision number has an exponent of 15, adding a
* specific value will cause overflow into bit 31, which converts
* the upper 9 bits into ones. Thus:
* inf_nan_mask ==
* 0x7F800000 if the half-precision number is
* NaN or infinity (exponent of 15)
* 0x00000000 otherwise
*/
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
/*
* If nonsign equals 0, subtracting 1 will cause overflow, setting
* bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result
* propagates bit 31 across all bits in zero_mask. Thus:
* zero_mask ==
* 0xFFFFFFFF if the half-precision number is
* zero (+0.0h or -0.0h)
* 0x00000000 otherwise
*/
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
/*
* 1. Shifts nonsign left by renorm_shift to normalize it (for denormal
* inputs).
* 2. Shifts nonsign right by 3, adjusting the exponent to fit in the
* 8-bit exponent field and moving the mantissa into the correct
* position within the 23-bit mantissa field of the single-precision
* format.
* 3. Adds 0x70 to the exponent to account for the difference in bias
* between half-precision and single-precision.
* 4. Subtracts renorm_shift from the exponent to account for any
* renormalization that occurred.
* 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input
* was NaN or infinity.
* 6. ANDs with the inverted zero_mask to set the mantissa and exponent
* to zero if the input was zero.
* 7. Combines everything with the sign bit of the input number.
*/
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70/*C02*/ - renorm_shift) << 23/*C03*/)) | inf_nan_mask) & ~zero_mask);
}
C02 = 0x70
C03 = 0x17
The variables w
, sign
, and nonsign
are explained in detail in the comments.
renorm_shift
calculates the number of leading zeros. If this number exceeds 5
, it indicates that h
is denormalized. In such cases, converting to FP32 format increases the representable range, so re-normalization is required.
If the exponent part of nonsign
is all ones (0x7F8XXXX
, representing NaN or infinity),
(nonsign + 0x04000000) >> 8
will yield 0xFF800000
.
The inf_nan_mask
will ultimately be 0x7F800000
for these cases (NaN or infinity). Otherwise, it will be 0x00000000
.
If nonsign
is 0, subtracting 1 results in an overflow, yielding 0xFFFFFFFF
.
Final steps:
nonsign << renorm_shift >> 3
This adjusts the mantissa of an FP16 denormalized number by left-shifting it to match the position of the mantissa in an FP32 normalized number.
(0x70 - renorm_shift) << 23
127 - 15 = 112 = 0x70
.renorm_shift
compensates for the left shift performed in step 1.Adding the results of steps 1 and 2
The C function is_aligned determines whether a given pointer ptr is aligned to the specified alignment. It returns 1 if the pointer is properly aligned and 0 otherwise.
bool is_aligned(void *ptr, size_t alignment) {
return ((uintptr_t) ptr % alignment) == 0;
}
If the alignment is guaranteed to be a power of 2, the modulo operation can be optimized using a bitwise-AND operation.
You are expected to write a function align_ptr
that aligns a pointer to the next address boundary based on the specified alignment. This function adjusts the given pointer to the next memory address that is a multiple of the provided alignment (which must be a power of 2, such as 1, 2, 4, or 8). If the pointer is already aligned, the original pointer will be returned. The alignment operation is performed using bitwise arithmetic without any conditional checks, making the function efficient.
static inline uint8_t *align_ptr(const uint8_t *ptr, uintptr_t alignment) {
alignment--; /* Adjust alignment to create a mask for the bitwise operation */
return (uint8_t *) (((uintptr_t) ptr + alignment) & ~alignment /*D01*/ );
}
D01 = ~alignment
Assuming alignment
= 8
. decrementing alignment--
changes it to 7
(0b0111
).
7
to ptr
ensures there is enough space for alignment.& ~7
(0b1111...1000
) clears the lower bits of the pointer, effectively aligning the address to the nearest multiple of 8
.Assuming that uintptr_t
represents the native word size of the platform (which is typically 4 or 8 bytes), you are required to implement your own version of memcpy
that prevents word-sized memory accesses at unaligned addresses, thus avoiding potential performance penalties or misaligned access problems. This implementation is more efficient than byte-by-byte copying, especially for larger memory blocks, as it utilizes word-sized memory transfers where possible.
#include <stddef.h>
#include <stdint.h>
void *my_memcpy(void *dest, const void *src, size_t n) {
uint8_t *d = (uint8_t *)dest;
const uint8_t *s = (const uint8_t *)src;
/* Copy byte-by-byte until destination is word-aligned */
while (n > 0 && ((uintptr_t)d % sizeof(uintptr_t)) != 0) {
*d++ = *s++; /*D02*/
n--;
}
/* Copy word-by-word */
uintptr_t *d_word = (uintptr_t *)d;
const uintptr_t *s_word = (const uintptr_t *)s;
while (n >= sizeof(uintptr_t)) {
*d_word++ = *s_word++; /*D03*/
n -= sizeof(uintptr_t);
}
/* Copy any remaining bytes */
d = (uint8_t *)d_word;
s = (const uint8_t *)s_word;
while (n > 0 /*D04*/ ) {
*d++ = *s++; /*D05*/
n--;
}
return dest;
}
D02 = *d++ = *s++
D03 = *d_word++ = *s_word++
D04 = n > 0
OR n
OR n != 0
D05 = *d++ = *s++
First, copy byte by byte until the target address is aligned to the word boundary (4- or 8-byte alignment):
*d = *s
d++, s++
Second, Copy in word-sized chunks (4 or 8 bytes)
*d_word++ = *s_word++
Final, handle leftover bytes:
Examine a C program that includes multiple memory bugs, categorized as shown in the following table.
ID | Memory bug |
---|---|
A | Dangling Pointer |
B | Mismatched Memory Management Functions |
C | Use-After-Fre |
D | Invalid Free |
E | Memory Leak |
F | Out-of-Bounds Access |
G | Double Free |
#include <stdio.h>
#include <stdlib.h>
/* Bug 1: Memory Leak */
void func1()
{
int *data = malloc(100 * sizeof(int));
if (data == NULL) {
printf("Memory allocation failed\n");
return;
}
/*E01*/
/* Memory leak: data is never freed */
}
/* Bug 2: Invalid Memory Access (Out-of-Bounds Access) */
void func2()
{
int *arr = malloc(5 * sizeof(int));
if (!arr) {
printf("Memory allocation failed\n");
return;
}
/*E02*/
/* Invalid memory access: writing beyond allocated memory */
for (int i = 0; i <= 5; i++)
arr[i] = i;
free(arr);
}
/* Bug 3: Double Free */
void func3()
{
int *ptr = malloc(10 * sizeof(int));
if (!ptr) {
printf("Memory allocation failed\n");
return;
}
free(ptr);
/*E03*/
/* Double free: freeing the same memory again */
free(ptr);
}
/* Problem 4: Dangling Pointer */
int *func4()
{
int *arr = malloc(5 * sizeof(int));
if (!arr)
return NULL;
/*E04*/
/* Memory is freed, but pointer is returned */
free(arr);
return arr;
}
/* Problem 5: Invalid Free (Freeing Stack Memory) */
void func5()
{
int local_var = 42;
/*E05*/
/* Invalid free: trying to free stack-allocated memory */
free(&local_var);
}
/* Problem 6: Mismatched Memory Management Functions */
void func6()
{
/* Memory allocated with malloc */
int *data = malloc(10 * sizeof(int));
if (!data)
return;
/*E06*/
/* Incorrect deallocation using delete (should use free) */
delete data;
}
int main()
{
/* Call each function to demonstrate the memory issues */
func1();
func2();
func3();
int *dangling_ptr = func4();
if (dangling_ptr)
printf("Dangling pointer value: %d\n", dangling_ptr[0]);
func5();
func6();
return 0;
}
Now, you need to identify the causes of the memory bugs by selecting from the provided table and proposing fixes. Fields #E01
, #E02
, #E03
, #E04
, #E05
, and #E06
must be filled in the format "ID:code", where ID corresponds to the item listed in the table. For example, 'A' stands for 'Dangling pointer,' and 'code' represents your proposed change. If the code is empty, it means the suggested action is to remove the indicated line. For instance, 'D:' indicates the memory bug is classified as 'invalid free,' and the corresponding line should be removed. Note that no spaces are allowed in #E01
, #E02
, #E03
, #E04
, #E05
, and #E06
.
E01 = E:free(data);
E02 = F:for(i=0;i<5;i++)
E03 = G:
E04 = A:
E05 = D:
E06 = B:free(data);
The code section has already been thoroughly explained, so no additional annotate are provided here.
You are required to write a C function that allocates nodes for a singly linked list with a specified number of nodes. The function dynamically allocates memory for the list, properly linking each node’s next pointer. The data field of each node remains uninitialized. It returns a pointer to the head of the newly created linked list, or NULL if the length is 0. The caller is responsible for freeing the allocated memory.
typedef struct node {
int data;
struct node *next;
} node_t;
node_t *list_create(const int length)
{
node_t *head;
node_t **ptr = &head;
for (int i = 0; i < length; ++i) {
node_t *node = malloc(sizeof(node_t));
if (!node) {
printf("Memory allocation failed\n");
exit(1);
}
node->data = i; /* Initialize the data field for testing */
*ptr = node;
ptr = &node->next; /*F01*/
}
*ptr = NULL;
return head;
}
F01 = &node->next
OR &(node->next)
To ensure that the newly added nodes in the next iteration of the loop do not break the chain, ptr
must pre-link node->next
.
Memory relationship:
ptr
→ node
→ node
( ** )→ ( * ) → (data, next)
A01 = addi t1, x0, 32
OR equivalent
A02 = bltz
A03 = bltz
A04 = shift_and_add_loop
A05 = shift_and_add_loop
A06 = andi
A07 = shift_and_add_loop
A08 = sw t0, 0(a4)
#A01 # Set bit counter
From the line addi t1, t1, -1 # Decrease bit counter
, it can be inferred that the counter is stored in the register t1
.
So A01
= addi t1, x0, 32
or an equivalent is used to initialize the counter t1
with the value 32.
#A02 a1, handle_negative1 # If multiplier negative
If multiplier negative ( multiplier < 0 ) jump to handle_negative1
So, #A02
= bltz
#A03 a3, handle_negative2 # If multiplicand negative
Same principle as #A02
j, #A04 # Continue to main loop
jump to shift_and_add_loop:
So, #A04
= shift_and_add_loop
j, #A05 # Skip to main loop
Same principle as #A04
#A06 t2, a1, 1 # Check least significant bit
a1 & 0x001
can check least significant bit
So, #A06
= andi
j #A07 # Repeat loop
Same principle as #A04
#A08 # Store final result
Store accumulator t0
to address a4
So, #A08
= sw t0, 0(a4)
B01 = -20
B02 = beq
B03 = jal
B04 = jal x1, hanoi
B05 = jalr
B06 = 20
B07 = jalr
B08 = 20
B09 = jalr
B10 = jal
B11 = addi a7, x0
or B11 = li a7,
addi sp, sp, #B01 #Allocate stack
Below store 5 register x1
, a1
, a2
, a3
and a4
Each register stores 4 bytes, so the required space is calculated as:5×4=20
So #B01
= -20
#B02 a1, t0, return # If there's only 1 disk, jump to return
if (a1
equal t0
(t0=1
)) jump to return
. So #B02
= beq
#B03 x1, hanoi # Recursive call to hanoi
Since the function needs to return after execution and the return address must be saved to x1
, B03
is the jal
(Jump and Link) instruction.
#B04 # Recursive call to hanoi
Same principle as #B03
#B05 x0, x1, 0 # Return to the caller
jal x1, hanoi
# Call the function: save the return address in x1
jalr x0, x1, 0
# Return: read the address from x1 and jump back
Equivalent to the following in C:
void hanoi() {
// ...
return; // Uses jalr to return
}
Thus, #B05
= jalr
.
addi sp, sp, #B06 # Deallocate space on the stack
Based on the instruction lbu a3, 16(sp)
, it can be determined that #B06
= 20
#B07 x0, x1, 0 # Return to the caller
Same principle as #B05
addi sp, sp, #B08 # Deallocate space on the stack
Since #B01
and #B08
correspond to each other, it follows that #B08
= 20
.
#B09 x0, x1, 0 # Return to the caller
Same principle as #B05
#B10 x1, hanoi # Call the hanoi function
Same principle as #B03
#B11, 10 #Exit syscall
RISC-V system call (syscall):
a7
register is used to store the system call number.li
(Load Immediate) instruction is used to load an immediate value.li a7, 10
ecall
This is equivalent to the following in C:
exit(0);
/* Iterative Tower of Hanoi Using Gray Code */
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
static void print_move(int disk, char from, char to)
{
printf("Move Disk %d from '%c' to '%c'\n", disk, from, to);
}
int main()
{
int n_disks = 3; /* Number of disks */
int total_moves = (1 << n_disks) - 1;
const char pegs[3] = {'A', 'B', 'C'}; /* Peg labels */
int pos[n_disks]; /* Disk positions: 0-A, 1-B, 2-C */
/* Initialize all disks to peg A */
for (int i = 0; i < n_disks; i++)
pos[i] = 0;
/* Set direction based on parity of number of disks */
int direction = (n_disks & 1) ? -1 /* counter-clockwise */
: 1; /* clockwise */
/* Predefined packed mapping arrays */
const uint8_t peg_map[3] = {
0x9, /* Peg A: {CCW -> C (2), CW -> B (1)} */ // #C01
0x2, /* Peg B: {CCW -> A (0), CW -> C (2)} */ // #C02
0x4 /* Peg C: {CCW -> B (1), CW -> A (0)} */ // #C03
};
/* Calculate direction index: -1 -> 0 (CCW), 1 ->1 (CW) */
int direction_index = (direction + 1) / 2;
for (int n_moves = 1; n_moves <= total_moves; n_moves++) {
int curr_gray = n_moves ^ (n_moves >> 1);
int prev_gray = (n_moves - 1) ^ ((n_moves - 1) >> 1);
int changed_bit = curr_gray ^ prev_gray;
/* Identify the disk to move (0-indexed) */
int disk = __builtin_popcount(changed_bit - 1); // #C04
/* Current peg of the disk */
int curr_peg = pos[disk];
int new_peg;
if (disk == 0) {
/* Calculate shift: direction_index=0 (CCW) -> shift2,
* direction_index=1 (CW) -> shift0
*/
int shift = (1 - direction_index) << 1;
new_peg = (peg_map[curr_peg] >> shift) & 0x3; // #C05
} else {
/* Find the only peg not occupied by any smaller disk */
bool found_new_peg = false;
for (int p = 0; p < 3 && !found_new_peg; p++) {
if (p == curr_peg)
continue;
/* Check if any smaller disk is on peg p */
bool has_smaller = false;
for (int d = 0; d < disk; d++) {
if (pos[d] == p) {
has_smaller = true;
break;
}
}
if (!has_smaller) {
new_peg = p;
found_new_peg = true;
}
}
}
/* Execute the move */
print_move(disk + 1, pegs[curr_peg], pegs[new_peg]);
pos[disk] = new_peg;
}
return 0;
}
C01 = 0x9
C02 = 0x2
C03 = 0x4
C04 = changed_bit-1
C05 = 0x3
Peg A
movement options: 0b1001
CCW to C(2)
set 0b10
, CW to B(1)
set 0b01
Peg B
movement options: 0b0010
CCW to A(0)
set 0b00
, CW to C(2)
set 0b10
Peg C
movement options: 0b0100
CCW to B(1)
set 0b01
, CW to A(0)
set 0b00
int disk = __builtin_popcount(#C04);
Decide which disk to move
First observing the movement pattern of the Tower of Hanoi:
Now, observe the pattern of Gray codes:
This matches the movement pattern of the Tower of Hanoi, where each disk's movement frequency follows the same rule as the corresponding Gray code bit's change frequency.
move changed_bit changed_bit-1 popcount disk
1 1 (0001) 0 (0000) 0 0
2 2 (0010) 1 (0001) 1 1
3 1 (0001) 0 (0000) 0 0
4 4 (0100) 3 (0011) 2 2
5 1 (0001) 0 (0000) 0 0
6 2 (0010) 1 (0001) 1 1
7 1 (0001) 0 (0000) 0 0
8 8 (1000) 7 (0111) 3 3
changed_bit
always causes only one bit to change, which is exactly a power of 2#C04
= changed_bit-1
new_peg = (peg_map[curr_peg] >> shift) & #C05
#C05
= 0b11
= 0x3
D01 = add t2, t2, t1
D02 = srli
D03 = slli
D04 = sw t2, 0(x1)
OR equivalent
The code section has already been thoroughly explained, so no additional annotate are provided here.
A01 = 0x37
A02 = 0x17
A03 = 7
A04 = 15
A05 = 12
A06 = 20
A07 = -4
A08 = 4
A01、A02: The opcode of lui
and auipc
are as follows
A03~A05:
A06: According to/* imm[11:0] = inst[31:20] */
shift right 20 bits
A07: The function of auipc
is as shown in the figure
In the main
function, the program counter (PC
) was pre-incremented by PC + 4
. To correctly execute the auipc
instruction, the value of the PC
must be restored to its original value. Therefore, it is necessary to subtract 4 from the current PC
.
A08: In general, when a jump does not occur, PC = PC + 4
B01 = 0
B02 = 32
B03 = 33
B04 = 34
B05 = perform_one_cycle_of_booths_mutliplication(csa_out, multiplicand, multiplier)
B06 = 23
B07 = 15
B08 = 7
B09 = 31
B10 = sign_extend(partial_carry.lo, shift_amount, 64)
B11 = adder(partial_sum.hi, partial_carry.hi, alu_carry_in)
C01 = srli a1, a1, 8
C02 = beq a2, a4, sq_1
C03 = srai a2, a2, 1
C04 = srli a3, a1, 21
C05 = srai a0, a0, 12
C06 = mul a0, a0, a0
C07 = srli a0, a0, 15
C08 = srai a0, a0, 21
C09 = slli a1, a1, 9
C10 = srai a0, a0, 5
C11 = addi a0, a0, 16
C12 = sltu a4, a3, a0
C13 = slli a1, a1, 16
C14 = slli a2, a2, 23
Our goal is to find an approximation for
The Newton-Raphson formula is:
First, find the derivative of
Substitute
Additionally, the basic formula converges faster but has lower precision.
At this point, a correction formula is available, which converges more slowly but is effective for high-precision calculations. It is typically used as the final step for correction. The formula is as follows:
The explanation of implementing sqrtf
in RISC-V assembly code is as follows:
a0
is negative. If it is, jump to sq_0
.a1
. Left-shift the mantissa by 8 bits, add the hidden 1 from the floating-point representation, then right-shift it by 8 bits (C01 = srli a1, a1, 8
) to restore it to its original position.0 11010110 11010110000000000000000 -> shift left
0 11010110 00000000000000000000000 -> Set bit
1 11010110 00000000000000000000000 -> Restore
0 00000001 11010110000000000000000
a2
. If the exponent is 0, jump to sq_2
(which returns 0). If the exponent is 255 (INF), jump to sq_1
(which returns INF). The instruction for this is C02 = beq a2, a4, sq_1
.rsqrtapp
)srli a3, a1, 21
).a4
value occupies 8 bits and the mantissa will be multiplied by a4 twice, the mantissa is right-shifted by 7 bits (32 - 8 - 8 - 23). This step ensures that the mantissa is correctly aligned for subsequent multiplications.mul a0, a0, a4
correspound to srai a0, a0, 12
mul a0, a0, a0
srli a0, a0, 15
srli a0, a0, 21
D01 = Out = (¬A + ¬B)¬C
Alternative:
¬C
.(¬A + ¬B)
.1 and 2 are connected in series, resulting in: ¬C(¬A + ¬B)
When these conditions are satisfied, Out will connect to High and output 1.
Conversely, Out will connect to Low and output 0.
D02 = 00101000
D03 = A
Based on the input sequence, the state transition process is as follows:ADABCDADA
The FSM end state is A.
D04 = 3
The propagation delay is determined by the longest path's delay.
One of the longest paths is as follows:
D05 = 9
the critical path is
B0 -(1 gate)- P0
P0 -(3 gate)- Cout0
Cout0 -(2 gate)- Cout1
Cout1 -(2 gate)- Cout2
Cout2 -(1 gate)- Σ3
E01 = 42
Critical path = clk-to-q + longest CL + setup = 30ns for x_input to change (includes clk-to-q) + 9 AND + 3 setup = 42 ns
E02 = 18
Shortest CL: NOT → AND = 5 + 9 = 14ns clk-to-q + shortest CL = 4ns + 14ns = 18ns
E03 = 102
25 + 50 (clock period) + 4 (clk-to-q) + 14 (OR) + 9 (AND) = 102 ns
The clock period must ensure that data stabilizes and reaches the next register before the next clock edge. Additionally, we must ensure the data arrives before the setup time of the next clock edge. Thus, the clock period must be:
The maximum hold time prevents the current data from being overwritten by new data in the next clock cycle. It ensures that the current data can be correctly read. To ensure this, the time for old data to propagate through the shortest path must be greater than the clk-to-q time of the new data.
Thus,
When waiting for the next clock period, the wait time already accounts for the delay of x_input chang. The change in x_input occurs before the next clock edge (at 75ns), specifically at 55ns.
Thus, the actual delay affecting the output happens after the next clock edge.
Timeline:
25ns: First rising edge.
55ns: x_input changes.
75ns: Next rising edge (25 + 50).
79ns: clk-to-q delay (75 + 4).
93ns: OR gate delay (79 + 14).
102ns: AND gate delay (93 + 9).
F01 = 4
We have 9 instructions, so we need 4 bits to represent them.
F02 = 16
Recall that, in RISC-V, the immediates encoded in instructions work in units of half-instructions, so jumping up to 64 KiB in either direction is the same as saying we want to jump up to
half instructions in both directions. Since immediates are signed, we need 16 bits to represent this range of values.
F01:
F02: Because the least significant bit (LSB) is implicitly 0:
Each value is effectively a multiple of 2, meaning the actual range that needs to be represented is ±32,768 (65,536 / 2).
Calculating the number of bits needed:
This requires 15 bits to represent the range.
Add 1 bit for the sign bit. Thus, a total of 16 bits is needed.
A01 = blt x0, a0, recurse_case
A02 = jalr x0, ra, 0
A03 = addi sp, sp, -12
A04 = sw ra, 8(sp)
A05 = jal ra, square
A06 = lw ra, 8(sp)
A07 = addi sp, sp, 12
A08 = addi sp, sp, -4
A09 = jal ra, sum_of_squares
A10 = addi sp, sp, 4
A11 = jalr x0, ra, 0
A01: if n (a0) is less than or equal to zero, do the zero_case. Otherwise, do the recurse_case.
A02: return -> jr ra
(pseudo-instructions) = jalr x0, ra, 0
A03: Save 3 reg,
A04: return address save in ra
A05: Call square and store pc
to ra
.
A06: return address save in ra
A07: 3 reg,
A08: Save 1 reg,
A09: Call sum_of_squares and store pc
to ra
.
A10: 1 reg,
A11: jr ra
(pseudo-instructions) = jalr x0, ra, 0
From the perspective of the callee, is it necessary to save any registers in the prologue and epilogue of the function? If so, which registers need to be saved, and where should the prologue and epilogue be placed? If not, provide a brief explanation as to why. __ A12 __
A12 = (一定要有 No 或 Not,意思相近就給分)
No, we do not need to consider callee-saved registers because we are not using any in this function. However, since the function makes two function calls, it may be necessary to save the
ra
register in the prologue and restore it in the epilogue, right before thejalr x0, ra
, 0 instruction, just before the zero_case label.
A13 = sw ra, 0(sp)
A14 = and t3, t2, x1
x1
is ra
andi t3, t2, 1
A15 = slli t1, t1, 1
A16 = srli t2, t2, 1
A17 = bne t2, x0, square_loop
A18 = lw ra, 0(sp)
A19 = jalr x0, ra, 0
A13: return address save in ra
A14: To check whether the least significant bit (LSB) of t2
is 1, perform a bitwise AND operation between t2 and 1. The result will indicate the LSB and store in t3
.
A15: Shifting t1
left by 1 bit is equivalent to multiplying by 2
A16: Shifting t2
right by 1 bit is equivalent to multiplying by 2
A17: If t2 is not zero, do the square_loop. Otherwise, Continue to execute
A18: Restore return address
A19: jr ra
(pseudo-instructions) = jalr x0, ra, 0
B01 = 14
B02 = 25
B03 = 40M
The maximum allowable hold time for RegC is determined by the time it takes for the input of RegC to change. This is calculated as the clk-to-q delay of RegA or RegB plus the shortest combinational logic delay:
The minimum clock cycle time must account for the clk-to-q delay, the longest combinational logic delay, and the setup time of RegC:
A clock cycle time of 25 ns corresponds to a clock frequency of:
B01: shortest combinational logic delay
B02: One of cirtical path
B03:
C01 =~(k - 1)
(或等效的形式)
k & (~(k - 1)):Isolates the least significant bit set in 𝑘, which represents the smallest power of 2 divisor of 𝑘.
C02 = (意思相近就給分,務必提到 power of 2)
Since the division operation is performed by a power of 2, the compiler optimizes it into a bit shift. This results in an efficient 𝑂(𝑏) solution for finding the 𝑘-th element in a left-complete, array-based binary tree.
The purpose of this function is to find the position of the K-th smallest element in a subtree of height H, rooted at P, after sorting the elements in ascending order.
Given p = 3
, h = 2
, k = 3
, let's break it down step by step:
p << h
= 3 << 2
Calculate the starting position of the layer(leftmost node position)Binary representation of 3 : 11
Left-shifting by 2 bits : 1100 (=12)
k >> 1
= 3 >> 1
positioned elementBinary representation of 3 : 11
Right-shifting by 1 bit: 1
|
merges location information. At this point, the position of a leaf node of the subtree will be pointed out.k & ~(k-1)
This step is used to find the position of the rightmost 1 in k.k=3: 0011
k-1=2: 0010
~(k-1): 1101
k & ~(k-1): 0001 (=1)
k & ~(k-1)
nodes up according to the position of the leaf node in the 3rd stepk_child(3, 2, 3) = (((3 << 2) | (3 >> 1)) / (3 & ~(2)))
= ((12 | 1) / 1)
= 13
For the subtree at position 3 and height 2, the 3rd largest element is at position 13
D01 =
D02 =
The immediate field for the
jal
instruction is 20 bits, while for thejalr
instruction, it is only 12 bits. This meansjal
can cover a wider range of addresses. Similar to the previous addressing mode, the 20-bit immediate value is multiplied by 2 and added to the program counter (PC) to determine the target address. Since the immediate value is signed, it spans a range ofbytes, or 2-byte instructions. However, because we are dealing with 4-byte instructions, the effective range is instructions relative to the current PC.
D03 = "jalr"
D04 = "lbu"
D05 = "lhu"
D06 = "slli"
D07 = "slti"
D08 = "sltiu"
D09 = "andi"
D10 = "ecall"
D11 = "csrrs"
D12 = "bltu"
D13 = "bgeu"
D14 = "lui"
D15 = "auipc"
D16 = "jal"
D17 = e.u.i31_12
D18 = e.j.i10_1 << 1
D19 = e.s.i11_5 << 5
D20 = e.b.i4_1 << 1
Please see the link for RV32 instruction set architecture:
https://cs61c.org/fa24/pdfs/resources/reference-card.pdf
https://five-embeddev.com/riscv-user-isa-manual/Priv-v1.12/instr-table.html
D03:
D04:
D05:
D06:
D07:
D08:
D09:
D10:
D11:
D12:
D13:
D14:
D15:
D16:
D17:
D18:The purpose of the left shift is to recombine the immediate bits scattered at different locations in the instruction into a complete offset value
D19:
D20:
E01 = 665
sw = clk-to-Q + Mem-Read + max(RegFileRead + Mux(ASel), Imm-Gen +
Mux(BSel)) + ALU + DMEM Setup
= 5ns + 300 ns + 60ns + 100ns + 200ns
= 665ns
E02 = 800
lw = clk-to-Q + Mem-Read + max(RegFileRead + Mux(ASel), Imm-Gen +
Mux(BSel)) + ALU + Mem-Read + Mux(WBSel) + RegFileSetup
= 5ns + 300ns + 60ns + 100ns + 300ns + 15ns + 20ns
= 800ns
E03 = 500
jal= clk-to-Q + Mem-Read + Imm-Gen + Mux(BSel) + ALU + Mux(PCSel) + PCSetup
= 5ns + 300ns + 45ns + 15ns + 100ns + 15ns + 20ns
= 500ns
E04 = PC + 4 注意: E04-E06 的順序可換
E05 = ALUOut
E06 = MEM
E07 = Inst
E08 = PC
E09 = ALUOut 注意: E09 和 E10 的順序可換
E10 = RegReadData2
E11 = Inst
E01: The red line is the critical path
E02:
E03:
E04~E06:
E07:
INST represents the instruction itself.
The control logic needs to determine the next operation based on the content of the instruction.
Necessity of Passing INST in a Pipeline:
E08~E10:
E11: Same as E07.
F01 = 1
F02 = 2
F03 = 1
F04 = MEM
F05 = EX
F06 = WB
F07 = EX
F01~F03: addi rd
= and rs1
= sltiu rs1
F04~F07: forwarding path
F08 = 2
F09 = 3
F10 = 3
F11 = 4
F12 = 2
F13 = 3
F14 = 3
F15 = 4
F16 = 4
F17 = 3
F18 = 3
F19 = 4
F20 = 2-3-1-4
F08~F09: addi rd
= lw rs1
addi t0
, t0, 4, lw t1, 0(t0
)
F10~F11: lw rd
= add rs1
lw t1
, 0(t0), add t2, t1
, x0
F12~F13: This hazard can be resolved by MEM to EXE
F14~F19: Between instructions 3 and 4 are Load-Use Data Hazard,Since the lw
instruction retrieves its data in the MEM stage, but the subsequent add
instruction requires that data in the EXE stage, the pipeline must stall for 1 cycle.
In the next cycle, the data becomes available in the WB stage and can then be forwarded to the EXE stage of the add
instruction. This ensures correct execution while addressing the data hazard.
F20:
The original instruction order is 1-2-3-NOP-4
. Since instruction 1 has no dependencies, it can be inserted between instruction 3 and instruction 4, replacing the NOP.
The new instruction order becomes 2-3-1-4
.
F21 = 4
There are four hazards: a data hazard from
t1
between instructions 1 and 2, a data hazard froms0
between instructions 2 and 3, another data hazard froms0
between instructions 2 and 4, and a control hazard between instructions 4 and 5.
F22 = 2
Assuming the RegFile allows simultaneous read and write operations in the same cycle, two stalls are needed between instructions 1 and 2 (WB → ID), and another two stalls are needed between instructions 2 and 3 (WB → ID).
G01 = reg.io.q
G02 = Mux(count === 9.U, 0.U, result)
G03 = reg.io.d := next
G04 = val y = Output(UInt (16.W))
G05 = io.y := io.a & io.b
G01:val count = reg.io.q
?
G02:val next = Mux(count === 9.U, 0.U, result)
?
G05:is (3.U) { io.y := io.a & io.b }
?
G01: Connects the output of the register to the corresponding line for count.
G02: When count
equals 9, select 0 as the output.
Otherwise, select result
(the output of the adder) as the output.
This implements the functionality of resetting the counter to 0 after reaching 9.
G03: Connects the output of the Mux(next
) to the input of the register.
G04: The form of declaring the output y.
G05: Implement the remaining functionality and(&).
H01 = srli t2, a0, 31
H02 = srli t2, a1, 31
H03 = bgeu a2, t0, overflow_xe_gte_ye
H04 = srli a4, a4, 9
H05 = slli t2, t2, 31
H06 = 0x7F800000
H07 = slli t2, t2, 31
H01、H02: The sign bit is stored in the most significant bit, so performing a logical right shift of 31 bits will retrieve the sign bit.
H03: if a2
> 255(t0
) jump to overflow_xe_gte_ye
H04: In the previous operation, the significand was left-shifted to the highest bit. After completing the computation, it needs to be right-shifted by 9 bits to align properly.
H05: The sign is in the least significant bit. Left-shifting it by 31 bits moves it back to its correct position.
H06:
H07: Same as H05
H08 = sll a1, a1, t1
H09 = slli a1, a1, 10
H10 = slli a0, a0, 23
H08: t1 = -A0 = -(leading zeros)
A1 << (32 - leading_zeros)
is equivalent to A1 << (-leading_zeros)
.
Characteristics of shifting with negative numbers:
When using a negative value for a left shift, it is effectively interpreted as 32 + (-n) for the shift amount.
Thus, -leading_zeros for a left shift is the same as shifting by (32 - leading_zeros).
H09: The previous instruction shifts right by 9 bits.
Adding the step to remove the hidden '1' results in a total left shift of 10 bits.
H10: Left-shift the exponent by 23 bits.
This aligns it to the position where the exponent resides in the IEEE 754 format.
A01
40%
Since the L1 cache has a local miss rate of 50%, it means L2 cache is accessed 50 times out of the 100 total memory accesses. Given that L2 cache experienced 20 misses, the local miss rate of L2 cache is calculated as:
A02
25%
The number of accesses to the L2 cache equals the number of misses in the L1 cache. To find the local miss rate of the L2 cache, we divide its global miss rate by the miss rate of the L1 cache:
Substituting the values:
A03
30`
Let ( H ) represent the hit time of the L3 cache. By extending the AMAT equation, where the miss penalty of the L2 cache becomes the local AMAT of the L3 cache, we can express the system's AMAT as:
To ensure the AMAT does not exceed 8 cycles, we solve the inequality:
Simplifying this yields:
Thus, the maximum hit time for the L3 cache is 30 cycles.
A04~A06
6, 4, 10
The breakdown of the Tag:Index:Offset (T/I/O) is calculated as follows:
A07
50%
Since integer accesses are spacedbytes apart, each cache block holds two accesses. The first access to each block results in a compulsory miss, while the second access hits because both A[i]
andA[i + 128]
reside in the same cache block. This results in a cache hit rate of 50%.
A08
50%
The size of array A isbytes, which is exactly twice the size of the cache. After executing Line 7, the second half of A
resides in the cache, but Line 11 starts processing from the first half ofA
, meaning none of the cached data from Line 7 can be reused. Since memory accesses in Line 11 follow the same pattern as Line 7, the cache hit rate remains the same at 50%. The variabletotal
is likely kept in a register by the compiler, so it does not impact the cache hit calculation.
A09 = ? A10 = ? A11 = ? A12 = ?
1100, 1, 000, Miss or M
A13 = ? A14 = ? A15 = ? A16 = ?
1101, 1, 101, Replace or R
A17 = ? A18 = ? A19 = ? A20 = ?
0011, 0, 000, Replace or R
A21
27.3% or
There are 3 hits out of 11 total accesses, resulting in a hit rate of approximately 27.3%.
B01
4
There are four hazards present:
A data hazard between instructions 1 and 2 due tot1
.
A data hazard between instructions 2 and 3 caused bys0
.
A data hazard between instructions 2 and 4 also involvings0
.
A control hazard between instructions 4 and 5 caused by the branch instruction.
B02~B11
B02 = ?
2
B03 = ?
1
B04 = ?
2
B05 = ?
2
B06 = ?
2
B07 = ?
3
B08 = ?
2
B09 = ?
4
B10 = ?
no (or zero)
B11 = ?
3
Instruction | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 |
---|---|---|---|---|---|---|---|---|---|
1.sub t1, s0, s1 |
IF | ID | EX | MEM | WB | ||||
2.or -> nop |
IF | ID | |||||||
2.or s0, t0, t1 |
IF | ID | EX | MEM | WB | ||||
3.sw s1, 100(s0) |
IF | ID | EX | MEM | WB | ||||
4.bgeu s0, s2, loop |
IF | ID | EX | MEM | WB | ||||
5.add t2, x0, x0 |
IF | ID | EX | MEM |
C01: Using Branch Comp to compare the rs1
and rs2
of the beq
instruction. If they are equal, Branch Comp sets BrEq
to 1; otherwise, it sets to 0.
C02: The jal
instruction does not compare rs1 and rs2; therefore, BrLt
is a "*(don't care)" value.
C03: Using Branch Comp to compare the rs1
and rs2
of the beq
instruction. If rs1
less than rs2
, Branch Comp sets BrLt
to 1; otherwise, it sets to 0.
C04~C08: Only branch or jump instructions cause a jump.
PCSel
to 0.PCSel
to 1.PCSel
based on BrEq
and BrLt
, setting it to 0 or 1 accordingly.C09~C10: BrUn
is Branch Unsigned,1 determines signed comparison, and 0 performs unsigned comparison.
beq
instruction, the result is the same regardless of signed or unsigned comparison, so BrUn
is "*(don't care)"blt
instruction requires signed comparison, so BrUn
should be set to 0.bltu
instruction requires unsigned comparison, so BrUn should be set to 1.C11~C15:
lw
(load word): Loads data from memory into register, so set RegWEn
to 1.sw
(store word): Stores data from a register into memory but does not write to a register, so set RegWEn
to 0.RegWEn
to 0.jal
(Jump and Link): The instruction format is jal rd, label
, where it performs rd = PC + 4
and PC = rs1 + imm
. Since it writes to the register rd
and updates the PC
, set RegWEn
to 1.C16: WBSel
(Write Back Select) determines the source of the data to be written into the register:
jal
instruction, which writes PC + 4
into rd
, WBSel
is set to 2.D01 = ?
sll s3, s3, s2
D02 = ?
and s3, a1, s3
D03 = ?
sll s5, s5, s4
D04 = ?
or s1, s1, s5
D05 = ?
addi s0, s0, 1
================
D06 = ?
sub t1, t1, t0
E01 = ? E02 = ? E03 = ?
0x13, 0x17, 1
E04 = ? E05 = ? E06 = ?
0x10, 0x13, 1
E07 = ? E08 = ? E09 = ?
0x20, 0x12, 1
E10 = ? E11 = ? E12 = ?
0x23, 0x18, 1
E13 = ? E14 = ? E15 = ?
0x11, 0x14, 0
E16 = ? E17 = ? E18 = ?
0xac, 0x15, 1
E19 = ? E20 = ? E21 = ?
0x34, 0x19, 1
- Access
0x11F0
(Read): Hit, Physical Address:0x14F0
; Updated LRUs:1, 7, 2, 5, 7, 0, 3, 4
Corresponding Virtual Page:0x11
- Access
0x1301
(Write): Miss, Map VPN0x13
to PPN0x17
, set valid and dirty bits.
Physical Address:0x1701
; Updated LRUs:2, 0, 3, 6, 7, 1, 4, 5
- Access
0x20AE
(Write): Hit, set dirty bit.
Physical Address:0x12AE
; Updated LRUs:3, 1, 4, 0, 7, 2, 5, 6
- Access
0x2332
(Write): Miss, Map VPN0x23
to PPN0x18
, set valid and dirty bits.
Physical Address:0x1832
; Updated LRUs:4, 2, 5, 1, 0, 3, 6, 7
- Access
0x20FF
(Read): Hit, Physical Address:0x12FF
; Updated LRUs:4, 2, 5, 0, 1, 3, 6, 7
- Access
0x3415
(Write): Miss (Replacement Needed). Replace the last entry, Map VPN0x34
to PPN0x19
, set dirty bit.
Physical Address:0x1915
; Updated LRUs:5, 3, 6, 1, 2, 4, 7, 0
From the statement:
16-bit addresses: This indicates the total virtual address space is 16 bits.
256-byte pages: Since a page size is 256 bytes, the Page Offset requires
The remaining bits 16 - 8 = 8 correspond to the Virtual Page Number (VPN).
F01 = ?
4
With a memory size of 4KiB divided into 64B chunks, there are 64 total chunks. This requires 4 exponent bits to cover values from 0 to 63. The highest exponent value before applying the standard bias (excluding infinity/NaN values) is
Applying the standard bias of -7 results in a maximum exponent value of 7, sufficient to cover values up to 63.
Conversely, with only 3 exponent bits, the maximum exponent value after applying a bias of -3 is 3, insufficient to represent the value 63.
F02 = ?
11
The largest memory address to represent is, corresponding to . To encode this value, we need 11 mantissa bits, resulting in the representation &0b1.11111111111 \times 2^5&. Thus, 11 bits of mantissa are necessary.
F01: Focusing only on the EXPONENT part
4 bits:
The maximum value that can be represented is 0b1110
(since 0b1111
is reserved for INF).
Subtracting the bias (0b0111
), the result is
3 bits:
The maximum value that can be represented is 0b110
.
Subtracting the bias (0b011
), the result is
G01 = ?
maskeq_epi32(tmp, curr)
G02 = ?
firstv_epi32(tmp)
G03 = ?
arr[i] > running_max (or equivalent)
In the first for loop, the steps are as follows:
tmp
.tmp
is loaded into curr
.curr
with running_max
:
curr
is greater than running_max
, proceed to the subsequent steps.curr
:
maskeq_epi32([2, 4, 8, 6], 8) → [0, 0, 0xFFFFFFFF, 0]
index = i + firstv_epi32(tmp)
to determine the position of the first 0xFFFFFFFF bit in the mask.0xFFFFFFFF
has its last bit set to 1, this gives the correct index.running_max
and Compare Remaining Elements:In the second loop, the remaining numbers (those fewer than 4 elements) are handled using a normal element-by-element comparison.
Reason for This Design:
SIMD (Single Instruction, Multiple Data) operations work on fixed-size groups of elements, which in this case is 4. If the total number of elements is not divisible by 4, the leftover elements cannot be processed using SIMD instructions.
H01 = ?
4
There are 9 instructions in total, so we need 4 bits to represent them, as
H02 = ?
16
Jumping 64 KiB in either direction equalsbytes. Since immediate values are counted in half-instructions, we need 15 bits for the signed offset, covering to . An extra bit is required for the sign, making the immediate field 16 bits.
H03 = ?
128
Total bits used by immediates and the opcode:
Remaining bits for registers:
There are 5 registers (rs1
,rs2
,rs3
,rd1
,rd2
). Dividing the remaining bits equally among the registers:
This allowsregisters.
I01 = ?
4096 (Compulsory Misses for Loop A)
I02 = ?
0 (No Conflict or Capacity Misses for Loop A)
I03 = ?
4096 (Total Misses for Loop A)
I04 = ?
4096 (Compulsory Misses for Loop B)
I05 = ?
28672 (Conflict and Capacity Misses for Loop B)
I06 = ?
32768 (Total Misses for Loop B)
For Loop A (i,j):
Compulsory misses (I01): Each cache line is loaded once for first access
Conflict/Capacity misses(I02):
Total(I03) = 4096 + 0 = 4096
For Loop B (j,i):
Compulsory misses (I04): Same as Loop A, I04 = 4096
Conflict/Capacity misses (I05):
Total(I06) =
I07 = ?
4096 (L2 Cache Misses for Loop A)
I08 = ?
4096 (L2 Cache Misses for Loop B)
L2 cache:
I07:
Compulsory misses:
Each cache line is loaded once for first access -> 4096miss
Conflict and Capacity Misses: good spatial locality -> No additional misses
Total = 4096 miss
I08:
Compulsory misses: 4096 miss
Conflict and Capacity Misses:
8-way set associative with FIFO can hold all rows -> No additional misses
Total = 4096 miss
I09 = ?
11.75 cycles (AMAT for Loop A)
I10 = ?
22.25 cycles (AMAT for Loop B)
AMAT =
I09:
I10:
I11 = ?
Yes (Virtual Address Aliasing Possible)
I12 = ?
8 (Minimum Set-Associativity Required)
Virtual address aliasing problem:
When the cache size exceeds(>) the page size, different virtual addresses may map to the same physical address, causing an aliasing problem.
Here, 32 KiB > 4 KiB, so aliasing issues may occur.
To avoid the aliasing problem, a sufficient set associativity is required.
The set associativity must be at least equal to the cache size divided by the page size to ensure that the same physical address does not cause conflicts.
32 KiB / 4 KiB = 8, meaning the cache needs to be at least 8-way set associative to prevent aliasing.