Try   HackMD

Quiz4 of Computer Architecture (2024 Fall)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
General Information

  • You are allowed to read lecture materials.
    • That is, an open book exam.
  • We are using the honor system during this quiz, and would like you to accept the following:
    1. You will not share the quiz with anyone.
    2. You will not discuss the material on the quiz with anyone until after solutions are released.
  • Each answer has 2 points.
  • You must answer in valid numeric representation and/or English alphabet except your formal name.
  • Always provide your answer in the shortest form. For example, use ptr->member instead of (*ptr).member.
  • Assume that each C program already includes the headers <stdint.h>, <stddef.h>, <stdlib.h>, <stdio.h>, and <string.h>.
  • The C standard referenced in this quiz is C99, officially known as ISO/IEC 9899:2018.
  • Bonus Points: After the class resumes at 10:35 AM, if you voluntarily participate in the class discussions, please email the instructor afterward, including your responses to the instructor's questions and any follow-up contributions. You will receive an additional 15 points for this quiz.
  • Message the instructor via Facebook if you found something wrong.
  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
    09:10 ~ 10:20AM on Nov 19, 2024

Problem A

Implement a RISC-V function called sum_of_squares that, given an integer n, calculates the following sum:

n2+(n1)2+(n2)2++12

If

n is not a positive integer (n ≤ 0), the function should return the value of
m
.

This task requires a recursive algorithm that uses the a1 register to keep track of the accumulated sum. You will implement the following function:

  • Function: sum_of_squares
  • Input:
  • a0: A 32-bit integer representing n. You can assume that n ≤ 10000.
  • a1: A 32-bit integer representing m.
  • Output:
  • The function should return in a0 the result of:
    m+n2+(n1)2+(n2)2++12
  • If n ≤ 0, return the value of m.

You have access to a helper function square that computes the square of an integer.

  • Function: square
  • Input:
  • a0: An integer n.
  • Output:
  • a0 will hold the value of
    n2
    .

To implement the recursive logic, consider updating the value of m at each step. Specifically, let:

m=m+n2

Then, the original sum:

m+n2+(n1)2++12

can be rewritten as:

m+(n1)2++12

This transformation allows you to reduce the problem size by decrementing

n and updating
m
with the new accumulated sum (
m
). Use this approach to implement the recursive function and follow the RISC-V calling convention.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

You are required to complete the RISC-V assembly code above, ensuring that no pseudo-instructions are used.

Fill in A01, A02, , A11.

From the perspective of the callee, is it necessary to save any registers in the prologue and epilogue of the function? If so, which registers need to be saved, and where should the prologue and epilogue be placed? If not, provide a brief explanation as to why. __ A12 __

Answer A12 with appropriate explanation.

Next, we will implement the square function, which calculates the square of an integer (a0 = n) and returns the result in a0. This function adheres to the RISC-V calling convention and avoids using RV32M instructions, ensuring compatibility with the RV32I base ISA.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

You are required to complete the RISC-V assembly code above, ensuring that no pseudo-instructions are used.

Fill in A13, A02, , A19.


Problem B

Similar to logic gates, registers also experience a delay before their output reflects the sampled input. This delay is called the clock-to-Q (clk-to-q) delay, where "Q" typically denotes the output. The clk-to-q delay is the time between the rising edge of the clock signal and when the register's output updates to match the input.

In the circuit below:

  • Registers RegA and RegB have setup, hold, and clk-to-q times of 4 ns each.
  • Each logic gate has a delay of 5 ns.
  • Register RegC has a setup time of 6 ns.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Determine:

  1. What is the maximum allowable hold time for RegC? __ B01 __ ns
  2. What is the minimum acceptable clock cycle time for this circuit? __ B02 __ ns
  3. What clock frequency does this correspond to? __ B03 __ HZ

Answer B01, B02, and B03


Problem C

Consider a left-complete binary search tree represented as an array. For example:

nodes = [8, 4, 12, 2, 6, 10, 14, 1, 3, 5, 7, 9, 11, 13, 15]

The corresponding binary tree structure is:

                8  (index 0)
        4  (index 1)      12 (index 2)
    2 (index 3)   6 (index 4)   10 (index 5)   14 (index 6)
  1 (index 7) 3 (index 8) 5 (index 9) 7 (index 10) 9 (index 11) 11 (index 12) 13 (index 13) 15 (index 14)

To find the position of elements in sorted order, we perform an in-order traversal (left-root-right) on this tree. The sorted order is:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

Let's go through two specific examples:

  1. 3rd Element in Sorted Order:
    • The 3rd element in the sorted order is 3.
    • In the array representation (nodes), 3 is at index 8 (0-based).
    • In 1-based indexing, this corresponds to position 9.
  2. 7th Element in Sorted Order:
    • The 7th element in the sorted order is 7.
    • In the array representation (nodes), 7 is at index 10 (0-based).
    • In 1-based indexing, this corresponds to position 11.

This example demonstrates how elements in a left-complete, array-based binary search tree can be efficiently accessed and positioned based on their sorted order.

Sorted Order Element 0-Based Index 1-Based Position
3rd 3 8 9
7th 7 10 11

Now let's generalize the problem. Suppose we want to find the position of the

k-th element in sorted order for a subtree rooted at position
p
with a given height
h
.

Define the function:

k_child(p, h, k)
  • p
    : Position of the root of the subtree (1-based indexing).
  • h
    : Height of the subtree (a tree with a single node has height 0).
  • k
    : The rank of the desired element in sorted order.

For example:

  • For the subtree rooted at position
    p=3
    (element 12) with height
    h=2
    , the 3rd element in sorted order is at position 13 (element 11).
  • For
    p=3
    ,
    h=1
    , and
    k=3
    , the 3rd element is at position 7 (element 7).

The typical method involves traversing the tree using a loop, utilizing the fact that the left and right child positions can be calculated as

2×p and
2×p+1
, respectively. This approach has a time complexity of
O(b2)
, where
b
is the number of bits in the integer representation (commonly 32 bits).

We can achieve a more efficient solution with a time complexity of

O(b) using a direct bitwise calculation:

#define k_child(p, h, k) \
    (((p << h) | (k >> 1)) / (k & ( /* C01, Your answer here */ )))

C01 is a valid C expression; write it as concisely as possible.

  • ph
    : Shifts the position
    p
    left by
    h
    bits, effectively scaling the root position to the height of the subtree.
  • k1
    : Shifts
    k
    right by 1 bit, halving it and adjusting for 0-based indexing.

The proposed approach avoids iterative traversal, greatly speeding up the computation, and you should complete the above. Explain whether the division (/) operation could become a performance bottleneck in this

O(b) solution. __ C02 __

Answer C02 with appropriate explanation.


Problem D

We have several addressing modes for accessing memory (excluding the immediate mode):

  1. Base Displacement Addressing: Adds an immediate value to a register's value to form a memory address (used by instructions like lw, lb, sw, and sb).
  2. PC-Relative Addressing: Uses the current value of the program counter (PC) and adds the immediate value (multiplied by 2) from the instruction to form a target address (used by branch and jump instructions).
  3. Register Addressing: Uses the value in a register as the target address (e.g., jalr, jr, and ret, where jr and ret are pseudoinstructions translated into jalr).

What is the maximum range of 32-bit instructions that can be reached from the current PC using a jump instruction?
The range covers addresses within

[D01,D02] instructions from the current PC.
Answer D01 and D02.

Assume we are implementing a minimal RISC-V RV32I disassembler, which operates as follows:

Disassembling raw binary: jalr.bin
   0:  00000097          auipc     x1, 0x0    
   4:  00808093          addi      x1, x1, 8
   8:  06428293          addi      x5, x5, 100
   c:  00008067          jalr      x0, x1, 0

You need to complete the program by filling in the specified placeholders.

1
2

3

4

5

6

7

8

D03, D04, , D20 are valid C expressions; write them as concisely as possible.

Answer D03, D04, , D20.


Problem E

The critical path is the longest delay path between state elements in a circuit. The circuit’s clock speed cannot exceed this limit, as a faster clock would not allow sufficient time for the correct value to propagate to the state elements. By placing additional registers along the critical path, we can reduce the clock period by decreasing the amount of logic between registers.

The delay times for each circuit element are as follows:

latency

CPU

For a single-cycle CPU, how long does it take to execute each instruction? Disregard the clock cycle duration based on the critical path, and assume that the setup times for both the register file (RegFile) and the program counter (PC) are identical.

  1. sw : __ E01 __ ns
  2. lw : __ E02 __ ns
  3. jal : __ E03 __ ns

Answer E01, E02, and E03.

modified

The top diagram shows the standard single-cycle datapath, while the second diagram illustrates a modified version. Referring to the modified single-cycle datapath, consider the information that must be carried forward from one stage to the next. Which pipeline registers are necessary at the end of each stage?

  • MEM to WB:
  • __ E04 __ : input into WBSel mux.
  • __ E05 __ : input into WBSel mux.
  • __ E06 __ : input into WBSel mux.
  • __ E07 __ : input into next stage’s control logic.
  • EX to MEM:
  • __ E08 __ : input into the +4 block in the MEM stage.
  • __ E09 __ : is an input into DMEM.
  • __ E10 __ : is an input into DMEM,
  • __ E11 __ : input into next stage’s control logic.

Answer E04, E05, E06, , E11.


Problem F

For all questions, assume there is no branch prediction or double-pumping (i.e., no write-then-read in a single cycle for the RegFile). Most data hazards can be resolved through forwarding, where the result from the EX or MEM stage is sent directly to the EX stage for use by a subsequent instruction.

Note: How is forwarding (EX to EX or MEM to EX) implemented in hardware? We add two wires: one from the start of the MEM stage carrying the ALU output, and another from the start of the WB stage. Both wires connect to the A/B multiplexers in the EX stage.

Identify data hazards in the code below and determine how forwarding could be used to resolve them.

hazard

There are two data hazards: one between instructions __ F01 __ and __ F02 __ , and another between instructions __ F03 __ and 3. The first hazard can be resolved by forwarding the ALU output from the __ F04 __ stage in cycle C3 to the start of the __ F05 __ stage in cycle C4. The second hazard can be resolved by forwarding the ALU output from the __ F06 __ stage in cycle C4 to the start of the __ F07 __ stage in cycle C5.

Answer F01, F02, and F03; all are numbers.
Answer F04 F07; all are names of stages.

There are two data hazards present in the code:

  1. The first hazard is between instructions __ F08 __ and __ F09 __ , involving the register t0.
  2. The second hazard is between instructions __ F10 __ and __ F11 __ , involving the register t1.

The hazard between instructions __ F12 __ and __ F13 __ can be resolved using forwarding. However, the hazard between instructions __ F14 __ and __ F15 __ cannot be resolved with forwarding. This is because instruction __ F16 __ requires the result of instruction __ F17 __ at the beginning of cycle C6, but the result is not available until the end of cycle C6.

To resolve this, we can introduce a stall by inserting a nop (no-operation) instruction between instructions __ F18 __ and __ F19 __ .

Answer F08, F09, F19; all are numbers.

Imagine you are the compiler and can reorder instructions to minimize data hazards while still producing the same output. How would you modify the code? Reorder the instructions as __ F20 __ , since instruction 1 has no dependencies and can be moved without affecting the result.

Answer F20; It is the sequence of numbers.

For the RISC-V code provided below and a pipelined CPU without forwarding, how many data hazards would occur? __ F21 __

How many stalls would be required to resolve the data hazard(s) if the RegFile supports double-pumping (i.e., write-then-read capability)? __ F22 __

Answer F21 and F22; Both are numbers.


Problem G

In Chisel, hardware components are referred to as modules. Each module extends the Module class and includes a field named io for defining its interface. The interface is specified using a Bundle, which is wrapped with a call to IO(). The Bundle contains fields representing the input and output ports of the module. The direction of these ports is specified using Input() or Output(), based on the perspective of the component itself.

We will demonstrate this with a simple design where a counter is constructed using two components: an adder and a register. Next, we create a counter using these two components that counts from 0 to 9 and then resets. The schematic of the counter is depicted below.

chisel

An adder is used to increment the current counter value by 1. A multiplexer selects between the incremented value and 0. The result of this selection, referred to as next, is used as the input to the register component. The output of the register represents the current count value and also serves as the output of the Count10 module (dout).

The following shows the Chisel code for the Count10 module. The two components are instantiated using new, wrapped with Module(), and assigned the names add and reg. The register's output (reg.io.q) is given the name count.

Chisel

We connect 1.U and count to the inputs of the adder component. The adder's output is named result. The multiplexer selects between 0.U and result based on the current counter value (count). The output of the multiplexer, named next, is connected to the input of the register component. Finally, we connect the counter value (count) to the output of the Count10 module (io.dout).

Complete the above to make the design work.

Answer G01, G02, and G03.

A key component in circuits that perform computation, such as a microprocessor, is the Arithmetic Logic Unit (ALU). The figure below depicts the symbol of an ALU.

ALU

The ALU has two data inputs, labeled a and b, one function input labeled fn, and an output labeled y. The ALU processes inputs a and b and produces the result at the output y. The fn input determines which operation is applied to a and b. Typical operations include arithmetic functions like addition and subtraction, as well as logical functions like AND, OR, and XOR, hence the name ALU.

The ALU is typically a combinational circuit with no state elements. It may also include additional outputs to indicate specific properties of the result, such as whether the result is zero or its sign.

The following code defines an ALU with 16-bit inputs and outputs, supporting addition, subtraction, OR, and AND operations, selected by a 2-bit fn signal.

Chisel

Answer G04 and G05.


Problem H

Examine the following RISC-V assembly code for the floating-point addition function (fadd), designed for single-precision floating-point addition using custom bitwise operations. Your task is to complete the code so that it functions as intended. This version includes simplified rounding logic, which may need enhancements for full IEEE 754 compliance.

fadd:
    # Unpack exponent of X
    srl     a2, a0, 23          # A2 = A0 >> 23
    andi    a2, a2, 0xFF        # A2 = A2 & 0xFF (Exponent of X)

    # Unpack exponent of Y
    srl     a3, a1, 23          # A3 = A1 >> 23
    andi    a3, a3, 0xFF        # A3 = A3 & 0xFF (Exponent of Y)

    # Flush-to-zero: if X exponent is zero, return Y
    beqz    a2, __addsf_return_y_flushed

    # If Y exponent is zero, return X
    beqz    a3, __addsf_return_x

    # Unpack significands and prepare for alignment
    slli    a4, a0, 9           # A4 = A0 << 9
    slli    a5, a1, 9           # A5 = A1 << 9

    # Check for NaN or Infinity
    li      t0, 255
    beq     a2, t0, __addsf_x_nan_inf
    beq     a3, t0, __addsf_y_nan_inf

    # Finish unpacking significands
    srli    a4, a4, 6           # A4 = A4 >> 6
    srli    a5, a5, 6           # A5 = A5 >> 6

    # Restore the implicit '1' bit
    li      t1, 1 << (23 + 3)   # t1 = 1 << 26
    or      a4, a4, t1          # A4 |= t1
    or      a5, a5, t1          # A5 |= t1

    # Negate significand of X if sign bit is set
    __ H01 __
    andi    t2, t2, 1
    beq     t2, x0, skip_neg_x
    neg     a4, a4              # A4 = -A4
skip_neg_x:
    # Store 25 in t1
    li      t1, 25

    # Negate significand of Y if sign bit is set
    __ H02 __
    andi    t2, t2, 1
    beq     t2, x0, skip_neg_y
    neg     a5, a5              # A5 = -A5
skip_neg_y:
    # Compare exponents
    bltu    a2, a3, __addsf_ye_gt_xe  # If A2 < A3, swap operands

    # Proceed with addition where exponent of X >= exponent of Y
__addsf_xe_gte_ye:
    # Compute exponent difference
    sub     t3, a2, a3          # t3 = exp_lhs - exp_rhs

    # Check if exponent difference is too large
    bgeu    t3, t1, __addsf_return_x  # If difference >= 25, return X

    # Shift significand of Y (rhs) right by exponent difference
    sra     a5, a5, t3          # A5 = A5 >> t3

    # Set sticky bit if bits were shifted out
    sll     t4, a5, t3          # t4 = A5 << t3
    sltu    t4, t4, a5          # t4 = (t4 < A5) ? 1 : 0
    or      a5, a5, t4          # A5 |= t4

    # Add significands
    add     a4, a4, a5          # A4 = A4 + A5

    # Check for zero result
    beqz    a4, __addsf_return_0

    # Convert from two's complement to sign-magnitude
    srai    t2, a4, 31          # t2 = Sign of result
    xor     a4, a4, t2          # If negative, invert bits
    sub     a4, a4, t2          # If negative, add 1

    # Normalize the result
    li      t5, 0               # t5 will count leading zeros
norm_loop_xe_gte_ye:
    sll     t4, a4, 1           # t4 = A4 << 1
    bltz    t4, norm_done_xe_gte_ye  # If MSB is 1, done
    addi    t5, t5, 1           # Increment leading zero count
    sll     a4, a4, 1           # Shift A4 left
    j       norm_loop_xe_gte_ye
norm_done_xe_gte_ye:
    sub     a2, a2, t5          # Adjust exponent

    # Check for underflow or overflow
    __ H03 __.                          # If exponent >= 255, overflow
    bltz    a2, underflow_xe_gte_ye     # If exponent < 0, underflow

    # Pack the result
    slli    a2, a2, 23          # Shift exponent to position
    __ H04 __                   # Align significand
    or      a0, a2, a4          # Combine exponent and significand

    # Restore sign
    __ H05 __                   # Shift sign bit back
    or      a0, a0, t2          # Set sign bit

    # Return
    jalr    x0, ra, 0

# Handle underflow
underflow_xe_gte_ye:
    add     a0, x0, x0          # Return zero
    jalr    x0, ra, 0

# Handle overflow
overflow_xe_gte_ye:
    li      a0, __ H06 __       # Set to Infinity
    __ H07 __                   # Shift sign bit back
    or      a0, a0, t2          # Set sign bit
    jalr    x0, ra, 0

# Case where exponent of Y > exponent of X
__addsf_ye_gt_xe:
    # Swap operands and repeat the process
    mv      t6, a0              # Temporarily store A0
    mv      a0, a1              # A0 = A1
    mv      a1, t6              # A1 = original A0
    mv      t6, a2              # Swap exponents
    mv      a2, a3
    mv      a3, t6
    mv      t6, a4              # Swap significands
    mv      a4, a5
    mv      a5, t6
    j       __addsf_xe_gte_ye   # Jump to the same handling

# Handle NaN or Infinity for X
__addsf_x_nan_inf:
    add     a0, a0, x0          # Return X
    jalr    x0, ra, 0

# Handle NaN or Infinity for Y
__addsf_y_nan_inf:
    add     a0, a1, x0          # Return Y
    jalr    x0, ra, 0

# Return Y if X is zero
__addsf_return_y_flushed:
    add     a0, a1, x0          # A0 = Y
    jalr    x0, ra, 0

# Return X if Y is zero
__addsf_return_x:
    jalr    x0, ra, 0

# Return zero result
__addsf_return_0:
    add     a0, x0, x0          # A0 = 0
    jalr    x0, ra, 0

Answer H01, H02, H07.

Consider the following implementation of converting a signed 32-bit integer to a single-precision floating-point number (IEEE 754 format) using the RISC-V RV32I assembly. The focus is on understanding the low-level operations required to perform this conversion without relying on pseudo-instructions or extended instruction sets.

# Convert signed int to float
# float i2f(s32 num);
# INPUT:  A0 = integer number
# OUTPUT: A0 = float number (IEEE 754 single-precision)
i2f:
    # Prepare result sign in A1
    srli    a1, a0, 31          # A1 = (A0 >> 31), extract sign bit
    slli    a1, a1, 31          # A1 = A1 << 31, position sign bit at bit 31

    # Get absolute value of the number
    beq     a1, x0, positive    # If sign bit is zero, number is positive
    sub     a0, x0, a0          # A0 = -A0 (negate to get absolute value)

positive:
    # Check if number is zero
    beq     a0, x0, zero_result # If A0 == 0, result is zero

    # Store sign and absolute value
    add     a2, x0, a1          # A2 = A1 (store sign)
    add     a1, x0, a0          # A1 = A0 (store absolute value)

    # Count leading zeros in A0 (result stored in T0)
    # Initialize T0 (leading zero count) to 0
    add     t0, x0, x0          # T0 = 0

clz_loop:
    sll     t1, a1, 1           # Shift A1 left by 1 bit
    addi    t0, t0, 1           # T0 = T0 + 1 (increment count)
    mv      a1, t1              # Update A1 with shifted value
    bgez    a1, clz_loop        # Loop while A1 >= 0 (MSB is 0)

    # Adjust for possible overflow in shift
    li      t1, 32              # Load 32 into t1
    blt     t0, t1, clz_done    # If t0 < 32, proceed to clz_done
    li      a0, 32              # If t0 >= 32, leading zeros = 32
    j       normalize

clz_done:
    # A0 now contains the number of leading zeros (stored in T0)
    add     a0, x0, t0          # Move T0 to A0

normalize:
    # Normalize the number (shift left by leading zeros)
    sub     t1, x0, a0          # t1 = -A0
    __ H08 __                   # A1 = A1 << (32 - leading zeros)

    # Prepare result exponent
    # Exponent = 158 - leading_zeros
    li      t1, 158
    sub     a0, t1, a0          # A0 = 158 - leading zeros

    # Rounding: Add 0x80 to A1 (rounding bit at position 7)
    addi    a1, a1, 128         # A1 = A1 + 0x80

    # Check for overflow after rounding
    bgez    a1, rounding_ok     # If A1 >= 0, no overflow
    addi    a0, a0, 1           # Exponent increment due to rounding overflow
    j       adjust_mantissa

rounding_ok:
    # Check if lowest 8 bits are zero (for rounding to even)
    andi    a3, a1, 0xFF        # A3 = A1 & 0xFF (lowest 8 bits)
    beq     a3, x0, round_even  # If lowest 8 bits are zero, round to even

adjust_mantissa:
    # Remove leading hidden bit '1'
    slli    a1, a1, 1           # A1 = A1 << 1 (remove hidden '1')
    j       compose_result

round_even:
    # Ensure even mantissa (clear least significant bit)
    srli    a1, a1, 9           # A1 = A1 >> 9
    __ H09 __                   # remove hidden '1'
    j       compose_result

compose_result:
    # Align mantissa and exponent
    srli    a1, a1, 9           # align mantissa to bits 0..22
    __ H10 __                   # align exponent to bits 23..30
    or      a0, a0, a1          # A0 = exponent | mantissa
    or      a0, a0, a2          # A0 = A0 | sign bit (bit 31)
    jalr    x0, ra, 0           # Return from function

zero_result:
    # Return zero (signed zero with correct sign)
    or      a0, x0, a2          # A0 = sign bit (A2)
    jalr    x0, ra, 0           # Return from function

Answer H08, H09, and H10.