contributed by < dhiptmc
>
RISC-V
Computer Architecture 2022
Decompress Run-Length Encoded List(LeetCode1313)
We are given a list nums of integers representing a list compressed with run-length encoding.
Consider each adjacent pair of elements [freq, val] = [nums[2i], nums[2i+1]] (with i >= 0). For each such pair, there are freq elements with value val concatenated in a sublist. Concatenate all the sublists from left to right to generate the decompressed list.
Return the decompressed list.
Input array and numsSize are known when we start this program.
We first traverse the input array (int i = 0; i < numsSize; i += 2) to find the size of the output array.
Next we collect the element after "freq" so as to get "val".
Note that each "val" needs to be insert into output array for "freq" times.
You can find the source code here. Feel free to fork and modify it.
:warning: Can you use fewer instructions?
:notes: jserv
Indeed, there are possible ways to use fewer instructions.
Modified version of Assembly code is shown below.
As a result, clock cycle as the figure shown below decreases compared to the original work.
Origin:
After modified:
We test our code using Ripes simulator
Fetch instruction from memory.
At this example, lw x12 0 x12 is in IF stage.
Current PC(Program counter) is 0x00000010, which leads to an instruction decoded as 0x00062603.
We can find out that there is a simple ALU that adds 4 to the current PC which eventually returns the next PC.
Meanwhile, current PC and PC+4 will be delivered to IFID for later use.
Noted that there is also a possibility that PC(PC Unit) receives a value from EXE stage.
Read registers while decoding the instruction.
At this example, addi x12 x12 64 is in ID stage.
Decode(Decode Unit) outputs the x12 address(0x0c) for the Registers, and Registers outputs the corresponding value stored in x12 to IDEX. In this circumstance, Reg2 is unnecessary for addi operation, so we can ignore it.
Imm. unit(Immediate Generator) outputs the immediaite value, in this case, 0x00000040(32'd64) will be delivered to IDEX for later use.
Noted that target register (x12) address for storing back to the Registers, PC, and PC+4 are also delivered to IDEX.
Execute the operation or calculate an address.
At this example, auipc x12 0x10000 is in EXE stage.
EXE stage is somewhat complicated. 4 MUXs(multiplexer) can be found.
The left two MUXs choose the source of the input registers, which can be directly come from IDEX (Reg1/Reg2) or from two different forwarding paths.
The right two MUXs choose whether the Op source comes from PC, IMM. unit or register. In this case, Op1 comes from PC, and Op2 comes from IMM. unit.
ALU unit eventually finsihes the arithmetic operation and outputs ths result to EXMEM, and outputs the target PC address to IF stage in some branch instructions.
In addition, a Branch unit takes value of two different registers in order to perform branch control, which splits the traditional operation from the ALU unit so as to alleviate the workload for ALU unit, and may have better performance for branch operations.
Noted that target register, one register input, and PC+4 is also delivered to EX/MEM.
Access an operand in data memory.
At this example, addi x11 x11 0 is in MEM stage.
Data memory unit recieves the Addr. and Data in signals to perform memory read/write. In this circumstance, there is no need to use Data memory unit, so we can simply ignore the Read out result.
In addition, result from the previous EXE stage can be forwarded to the current EXE stage.
Write the result into a register.
At this example, auipc x11 0x10000 is in WB stage.
MUX chooses the input signal from PC+4, Data memory unit, or result from the EXE stage.
In this circumstance, we choose the result from the EXE stage(0x10000000), and stores the value to the register whose target address is 0x0b.
In addition, output signal from the MUX can be forwarded to the EXE stage.
Memory condition when program starts
From the Instruction memory list, we can discover that the next few instructions has been loaded into the pipeline.
For instance, lw x12 0 x12 is in IF stage.
Memory updates during main_1
a1 gets the base address of nums1.
a2 gets the numsSize1.
a3 = returnSize which initials as 0 for later use.
a4 = i = 0 for later branch deciding.
s2 = 2 records numbers of testing sets need to be execute.
Memory updates during proc and inner_loop
t0 gets the nums1[i] address.
t1 first loads nums[i], and is used to decide whether to branch to proc or not.
Then, t1 loads nums[i+1], t2 gets result[count] address, and nums[i+1] is stored in "result[count] address".
Noted that we shift the index 2 bits so as to match the memory address.
Eventually, we make a5 += 1(j++), a6 += 1(count++).
t1 reloads nums[i], and we check the relationship between the value in a5 and t1, which leads us to decide whether go backs to proc or not.
If we don't need to branch to proc again, we add t1(nums[i]) to a3(returnSize), set a5(j) to 0, and add 2 to a4(i).
Memory updates during outer_loop
We reset a4(i), a5(j), a6(count) to zero, prepare for later use.
Memory updates after print
s2 value is decreased by 1.
Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed race hazards). There are three situations in which a data hazard can occur:
read after write (RAW), a true dependency
write after read (WAR), an anti-dependency
write after write (WAW), an output dependency
Read after read (RAR) is not a hazard case.
Control hazard occurs when the pipeline makes wrong decisions on branch prediction and therefore brings instructions into the pipeline that must subsequently be discarded. The term branch hazard also refers to a control hazard.
One way to improve branch performance is to reduce the cost of the taken branch. Moving the branch execution to the ID stage will be a wise choice, since it reduces the penalty of a branch to only one instruction if the branch is taken, namely, the one currently being fetched. Despite there may be several difficulties to conquer (several new units and mechanisms needed), it is still an improvement to the performance.
David A. PATTERSON, JOHN L. HENNESSY. Computer Organization and Design 4/e, page 379 FIGURE 4.62.
Origin:
Modified: