Check the sample pages, making the title consistent.
I change the title thanks.
bfloat16, also known as brain float 16, originated from the Google Brain team, which introduced this new floating-point standard to accelerate machine learning computations.
We know that floating-point numbers consist of three parts: Sign, Exponent, and Fraction (or Mantissa). The Exponent represents the range of values that can be expressed (also called dynamic range), and the Fraction represents the precision of the values. In a limited hardware space, a trade-off between these two properties is inevitable. The length of bfloat16 is only half that of float32. When performing calculations on the same CPU using bfloat16, the ideal throughput would be double that of float32. Therefore, if calculations are performed using bfloat16 under the same throughput, the CPU’s processing speed will ideally increase by 2x. Additionally, in terms of storage space, bfloat16 saves approximately half the space compared to float32.
Although the existing IEEE 754 standard already defines a half-precision floating-point format, known as float16, which has similar advantages over float32 in terms of computation speed and storage space, why do we still need bfloat16?
The first reason: float16’s dynamic range is not as wide as bfloat16. Google explains that “neural networks are more sensitive to the size of the exponent than the size of the mantissa.”
The second reason: the area of a float16 multiplier is twice that of a bfloat16 multiplier. This is because the physical size of a hardware multiplier increases with the square of the mantissa width. Since float16 has a mantissa of 10 bits and bfloat16 has a mantissa of 7 bits, squaring these values results in 100 and 49, respectively. Dividing 100 by 49 gives an area size approximately 2 times larger.
The third reason: conversion between float16 and float32 is more difficult than between bfloat16 and float32. This is because the formats of float16 and float32 differ entirely in both exponent size and mantissa size, so during conversion, both the exponent and mantissa must be adjusted. In contrast, the only difference between bfloat16 and float32 is in the mantissa size, making bfloat16 much easier to convert.
bfloat16 format:
float16 format:
float 32 format:
The process for converting between float32 and bfloat16 is as follows:
The first step in converting float32 to bfloat16 is to check if the value is NaN. If it is NaN, then the bfloat16 conversion will also result in NaN. NaN can be further classified into QNaN (Quiet NaN) and SNaN (Signaling NaN). The difference between the two is that QNaN represents an undefined result and does not cause a program interruption, while SNaN represents an uninitialized value and will cause a program interruption. Since we want to avoid interruptions during machine learning model computations, NaN must be forcibly converted to QNaN. By definition, if the Most Significant Bit (MSB) of the mantissa is 1, it is classified as QNaN; if the MSB is 0, it is classified as SNaN.
If the float32 value is not NaN, the mantissa needs to be checked and rounded if necessary. We use the “Round to Nearest, ties to Even” rounding mode. If the Least Significant Bit (LSB) of bfloat16 is 1, we add 1 to the binary representation of bfloat16; otherwise, no addition is necessary.
Finally, shift the entire value to the right by 16 bits to convert it into bfloat16 format.
To convert bfloat16 back to float32, simply shift the value to the left by 16 bits.
Why was union
used rather than arbitrary pointers?
A union allows different data types to share the same memory space, enabling direct conversion between a floating-point number and its corresponding binary representation. This allows seamless switching between the memory representation of a float and a uint32_t. In contrast, if type casting or pointer operations were used to achieve the same effect, such an approach might lead to undefined behavior.
Given an integer array nums
of size n
, return the number with the value closest to 0
in nums
. If there are multiple answers, return the number with the largest value.
Constraints:
Example 1:
Input: nums = [-4,-2,1,4,8]
Output: 1
Explanation:
The distance from -4 to 0 is |-4| = 4.
The distance from -2 to 0 is |-2| = 2.
The distance from 1 to 0 is |1| = 1.
The distance from 4 to 0 is |4| = 4.
The distance from 8 to 0 is |8| = 8.
Thus, the closest number to 0 in the array is 1.
Example 2:
Input: nums = [2,-1,1]
Output: 1
Explanation: 1 and -1 are both the closest numbers to 0,
so 1 being larger is returned.
My intuitive idea is to set a parameter called closest_num to represent the number closest to 0, with its default value set as the first element of the array. Using a for loop to scan through the array, if the absolute value of the current array element is smaller than the absolute value of closest_num, it means the current element is closer to 0, so I replace it with the new value of closest_num. Additionally, if the absolute value of the current array element is equal to the absolute value of closest_num, I need to check if the current element is greater than closest_num. If the current element is greater, it means it is larger, so I replace closest_num with the new value.
In the second version, I modified the first version where the process of setting the absolute value would jump to another set_abs function, and changed it to directly setting the absolute value within the findNonMinOrMax function. After this modification, we can also observe that the number of execution cycles decreased from 675 to 367, indicating a very effective optimization process.
After placing my second version of the assembly program into Ripes and successfully compiling it, I noticed that there were additional instructions that I hadn’t originally written. The reason for this is that my assembly program used some pseudo-instructions, which are created to make writing programs more convenient for programmers. In reality, the machine cannot execute pseudo-instructions. After being translated by the assembler, these pseudo-instructions are converted into equivalent assembly instructions. Additionally, I also noticed that the register names were converted from ABI names to the actual RISC-V register numbers.
I chose to use Ripes’ 5-stage processor, which includes a hazard detection unit and a forwarding unit. The architecture diagram is shown below:
The so-called 5-stage pipeline refers to the following stages::
1. IF (Instruction Fetch)
2. ID (Instruction Decode)
3. EX (Execution)
4. MEM (Memory Access)
5. WB (Write Back)
The key to allowing instructions to be executed in concurrent in a pipeline lies in the rectangular bars that separate each stage, also known as pipeline registers. These registers store the data and control signals for each stage, ensuring that the correct data and control signals are available for the next stage.
In the RISC-V ISA, machine code formats are divided into R, I, B, U, J, and S types. Below, I will introduce each instruction format in order and explain how they are executed in the pipeline.
In RISC-V assembly language, R-type instructions have two source registers and one destination register, which all follow this format. Examples include add, sub, and so on. The machine code format is as follows:
Taking add x29, x0, x7 as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and know that for the add instruction, funct7 = 0000000, funct3 = 000, and opcode = 0110011. The source register rs1 is x0, which is converted into binary as 00000, and rs2 is x7, which is converted into binary as 00111. The destination register rd is x29, which is converted into binary as 11101. Finally, combining this information results in the 32-bit machine code:
Finally, let’s use Ripes to see how the add instruction is executed in the pipeline. Using the code above as an example, here is an excerpt of the relevant portion of the code:
The Decoder receives the instruction 0x00700eb3 and performs the decoding.
The Immediate Generate Unit, based on the input type being R-type, finds that there is no available immediate field, so it outputs the default value 0xdeadbeef. For more on why the default is 0xdeadbeef, you can find discussions on the topic in the Reddit forum.
The Register Files, based on the input R1 idx, find the corresponding x0 register and output the content of the x0 register (0x00000000).
The Register Files, based on the input R2 idx, find the corresponding x7 register and output the content of the x7 register (0xfffffffc).
Finally, the relevant data is stored in the ID/EX register.
Although the value of rs2 is input to Data in and Res is input to Addr., under an R-type instruction, the control signal for Memory write is 0, indicating that no data is written to Memory.
Based on the address pointed to by Res, we can observe the state of the Memory. The address used is only up to 0x00000098, and addresses after that have not been used, meaning that 0xfffffffc has not yet been allocated any data. Therefore, the output at that address (Read Out) is 0x00000000.
Finally, the relevant data is stored in the MEM/WB register.
After completing all stages, the registers are updated as follows:
We can see that 0xfffffffc was successfully written to the t4 register.
In RISC-V assembly language, I-type instructions mainly consist of load instructions and immediate instructions (except for lui and auipc), such as lw, addi, subi, and so on. The machine code format is as follows:
Taking addi x9, x0, 3 as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and find that for the addi instruction, funct3 = 000 and opcode = 0010011. The source register rs1 is x0, which converts to binary as 00000, and the destination register rd is x9, which converts to binary as 01001. The immediate value 3 converts to binary as 000000000011. Finally, combining this information results in the 32-bit machine code:
Finally, let’s use Ripes to see how the addi instruction is executed in the pipeline. Using the code above as an example, here is an excerpt of the relevant portion of the code:
Although the value of rs2 is input to Data in and Res is input to Addr., under an S-type instruction, the control signal for memory write is 0, indicating that no data is written to Memory.
Based on the address pointed to by Res, we can observe the memory’s storage status. The address 0x00000003 corresponds to the address of Byte 3 at 0x00000000, while the other 3 bytes are taken from the address 0x00000004, meaning Byte 0, Byte 1, and Byte 2. Since RISC-V uses little-endian format, Byte 3 at 0x00000000 holds the least significant byte (LSB), and Byte 2 at 0x00000004 holds the most significant byte (MSB). Finally, the address outputs Read Out as 0x00091700. (Typically, memory accesses must follow alignment rules, where word accesses must be at addresses that are multiples of 4. However, this does not affect the pipeline as it doesn’t rely on the value read from memory.)
Finally, the relevant data is stored in the MEM/WB register.
After completing all stages, the register updates are as follows:
We can see that 0x00000003 was successfully written to the s1 register.
In RISC-V assembly language, S-type instructions are primarily related to Store Memory operations, such as sw, sb, and so on. The machine code format is as follows:
Taking sw x5, 32(x7) as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and find that for the sw instruction, funct3 = 010 and opcode = 0100011. The source register rs1 is x7, which converts to binary as 00111, and rs2 is x5, which converts to binary as 00101. The immediate value 32 converts to binary as 000000100000. Finally, combining this information results in the 32-bit machine code:
Finally, let’s use Ripes to see how the sw instruction is executed in the pipeline. Since the previous code did not include the sw instruction, a small example is created as follows:
The value of rs2 (0x00000064) is input to Data in, and Res (0x00000020) is input to Addr.. In the case of an S-type instruction, the Memory write control signal is 1, indicating that data will be written to Memory.
Based on the address pointed to by Res, we can check the Memory’s current status. The address 0x00000020 has not yet been allocated data, and the write operation has not yet occurred at this point, so the address outputs Read Out as 0x00000000.
Finally, the relevant data is stored in the MEM/WB register.
After the instruction completes all stages, the Memory is updated as follows:
We can see that 0x00000064 was successfully written to the memory address 0x00000020.
In RISC-V assembly language, B-type instructions primarily involve branch operations, such as beq, bne, bge, and so on. The machine code format is as follows:
We can see that in the machine code format, the position for immediate[0] is missing. The reason for this is that branch instructions are restricted to jump only to even addresses, so the actual immediate is 13 bits. After combining the machine code format into 12 bits, a 0 needs to be added on the right, which means that immediate[0] is always 0.
Therefore, the jump range for branch instructions is:
Taking bne x5, x0, -60 as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and find that for the bne instruction, funct3 = 001 and opcode = 1100011. The source register rs1 is x5, which converts to binary as 00101, and rs2 is x0, which converts to binary as 00000. The immediate value -60 converts to binary as 1111111000100. Finally, combining this information results in the 32-bit machine code:
Finally, let’s use Ripes to see how the bne instruction is executed in the pipeline. Using the code above as an example, here is an excerpt of the relevant portion of the code:
The value of rs1 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the previous instruction (addi x5, x5, -1) modified the x5 register, the forwarding function will be used here. Therefore, the first-level 3x1 multiplexer passes through the value from the previous instruction’s result (Res), generated in the EX stage, which is 0x00000004. In the second-level 2x1 multiplexer, composed of the output from the first-level 3x1 multiplexer and the program counter value, since the Branch Unit determines that a jump will occur, the second-level 2x1 multiplexer passes through the program counter value (0x00000074) into Op1.
The value of rs2 and the other two forwarding paths form the first-level 3x1 multiplexer, and since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs2 (0x00000000). In the second-level 2x1 multiplexer, composed of the output from the first-level 3x1 multiplexer and the program counter value, since the Branch Unit determines that a jump will occur, the second-level 2x1 multiplexer passes through the immediate value (0xffffffc4) into Op2.
The Branch Unit determines that the branch instruction will jump.
The ALU performs addition based on the control signals, adding 0x00000074 and 0xffffffc4, resulting in Res being 0x00000038.
After calculating the address, in the next cycle, the value of Res will be written to the Program Counter.
Finally, the relevant data is stored in the EX/MEM register.
We can see that when the branch instruction decides to jump, the two instructions immediately following the branch are cleared and replaced with nop (no operation).
The value of rs2 (0x00000000) is input into Data in, and Res (0x00000038) is input into Addr. However, under a B-type instruction, the Memory write control signal is 0, indicating that data cannot be written to Memory.
Based on the address pointed to by Res, we can check the memory status. The address 0x00000038 stores 0x0009a383, so the output from that address (Read Out) is 0x0009a383.
Finally, the relevant data is stored in the MEM/WB register.
After completing all stages, the register updates are as follows:
Since the bne instruction is preceded by the addi x5, x5, -1 instruction, the x5 register will be modified from 0x00000005 to 0x00000004. The bne instruction does not change the value of the register, so the x5 register remains 0x00000004.
In RISC-V assembly language, there are only two U-type instructions: lui and auipc. The machine code format is as follows:
We can see that in the machine code format, the position for immediate[11:0] is missing. This is because the lui and auipc instructions operate on the upper 20 bits, meaning the actual immediate needs to be left-shifted by 12 bits. This indicates that immediate[11:0] consists of 12 zeros as a result of the left shift.
Taking auipc x18, x10000 as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and find that the auipc opcode is 0010111. The destination register rd is x18, which converts to binary as 10010, and the immediate value 0x10000 converts to binary as 00010000000000000000. Finally, combining this information results in the 32-bit machine code:
Finally, let’s use Ripes to see how the auipc instruction is executed in the pipeline. Using the code above as an example, here is an excerpt of the relevant portion of the code:
Finally, the relevant data is stored in the MEM/WB register.
After completing all stages, the register updates are as follows:
We can see that 0x10000004 was successfully written to the s2 register.
In RISC-V assembly language, there is only one J-type instruction: jal. The machine code format is as follows:
Taking jal x1, 28 as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and find that the jal opcode is 1101111. The destination register rd is x1, which converts to binary as 00001, and the immediate value 28 converts to binary as 00000000000000011100. Finally, combining this information results in the 32-bit machine code:
Finally, let’s use Ripes to see how the jal instruction is executed in the pipeline. Using the code above as an example, here is an excerpt of the relevant portion of the code:
The value of rs1 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs1 (0x00000000). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the Program Counter value, since the jal instruction adds the Program Counter value to the immediate constant, the second-level 2x1 multiplexer passes through the Program Counter value (0x00000014) into Op1.
The value of rs2 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs2 (0x00000000). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the Program Counter value, since the jal instruction adds the Program Counter value to the immediate constant, the second-level 2x1 multiplexer passes through the immediate value (0x0000001c) into Op2.
The ALU performs addition based on the control signals, adding 0x00000014 and 0x0000001c, resulting in Res being 0x00000030.
After calculating the address, in the next cycle, the value of Res will be written to the Program Counter.
Finally, the relevant data is stored in the EX/MEM register.
We can see that when the jal instruction jumps, the two instructions immediately following the jal are cleared and replaced with nop (no operation).
Finally, the relevant data is stored in the MEM/WB register.
5. WB stage (Write back)
We can see that 0x00000018 was successfully written to the ra register.
In this architecture, executing instructions can lead to some issues, such as data hazards. A data hazard occurs when a later instruction requires data from an earlier instruction, but the data from the earlier instruction has not yet been written back to the register. As a result, the later instruction may access stale data, leading to incorrect results. If the data is incorrect, the final results of the executed instructions will not match the intended outcomes of the original instructions.
The forwarding unit is one solution to address data hazards. The core idea is to send the freshly produced data directly to the instruction that needs it. Using the code from above as an example:
After executing from the beginning to the 3rd cycle, the progress of each instruction in the pipeline and the contents stored in the registers are shown in the diagram below:
We can see that the instruction in the EX stage (auipc x18, 0x10000) needs to write back to the x18 register, while the instruction in the ID stage (addi x18, 52) also needs to read from the x18 register. However, it has already retrieved incorrect data in the ID stage (x18 = 0x00000000). Next, we observe the progress of each instruction in the pipeline and the contents stored in the registers after the 4th cycle, as shown in the diagram below:
Due to the forwarding unit, it (marked in yellow) pulls the data generated by the instruction (auipc x18, 0x10000) from the EX stage back to the EX stage. This data, along with Reg1 (marked in red) and another forwarding path (marked in blue), is fed into a 3x1 multiplexer. After detecting a data hazard between the instructions (auipc x18, 0x10000) and (addi x18, 52), the multiplexer is configured to allow the yellow forwarding path to pass through, enabling the instruction (addi x18, 52) to read the correct data from the x18 register (x18 = 0x10000004).
Finally, we manually calculate addi x18, x18, 52 (0x10000004 + 52 = 0x10000004 + 0x34), and the execution result sets x18 to 0x10000038. We check the result after the addi x18, x18, 52 instruction passes through the WB stage and confirm that this is indeed the case:
Although forwarding provides a way to resolve data hazards, there is a specific type of data hazard called Load-use data hazard that cannot be resolved simply through forwarding. A Load-use data hazard occurs when the destination register of a Load instruction is the same as the source register of the subsequent instruction. This is because the Load instruction needs to read data through the MEM stage before it can forward the required data. However, by the time the Load instruction reads the data from the MEM stage, the subsequent instruction has already computed its result in the EX stage, making it impossible to use the correct data. If the subsequent instruction could be stalled for one cycle, it would have time to use the forwarded data correctly.
The solution is to add a hazard detection unit that, when the destination register of the Load instruction matches the source register of the following instruction, stalls the subsequent instruction for 1 cycle, thus resolving the Load-use data hazard.
Here is a code example:
After executing from the beginning to the 3rd cycle, the progress of each instruction in the pipeline and the contents stored in the registers are shown in the diagram below:
We can see that the instruction in the EX stage (lw x6, 0(x5)) needs to write back to the x6 register, while the instruction in the ID stage (add x6, x7, x6) also needs to read from the x6 register. This will be detected as a Load-use data hazard by the hazard detection unit. Next, we observe the progress of each instruction in the pipeline and the contents stored in the registers after the 4th cycle, as shown in the diagram below:
The instruction originally in the ID stage (add x6, x7, x6) should have moved to the EX stage in the next cycle. However, because the hazard detection unit detected a Load-use data hazard, it stalled both the ID stage and the IF stage instructions for one cycle and cleared the EX stage to a NOP instruction. Next, we observe the progress of each instruction in the pipeline and the contents stored in the registers after the 5th cycle, as shown in the diagram below:
We can see that after stalling for 1 cycle, the instruction (add x6, x7, x6) can utilize the memory forwarding path (marked in yellow). The data transmitted by this forwarding path uses the value of the x5 register (0x00000064) as the memory address to store it in the x6 register. From the diagram below, we can see that the value stored at address 0x00000064 is 0x00000000, so the value returned by the forwarding path is 0x00000000. Finally, we manually calculate add x6, x7, x6 (0x00000000 + 0x00000000 = 0x00000000), resulting in x6 being set to 0x00000000.
We check the result after the instruction add x6, x7, x6 passes through the WB stage, and we confirm that this is indeed the case:
Change the permissions of the above diagrams, making them visible to all!
I change the permissions of the above diagrams, thanks.
The RISC-V Instruction Set Manual Volume I (Version 2024/04/11)