LemonTea34
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    1
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # 16-bit Fixed-point Format > [!note] AI tools were used for data lookups and proofreading. Feedback on any mistakes is appreciated. ## Overview The 16-bit fixed-point format represents numbers with a fixed range for both the integer and fractional parts. It is suitable for hardware that does not support floating-point arithmetic. **Structure:** 1-bit Sign + 5-bit Integer + 10-bit Fraction >**Note:** The structure above is an example of a 16-bit fixed-point format. This format differs from the IEEE 754 standard floating-point format. Users can modify the bit allocation between the integer and fractional parts according to their application requirements. ### Qm.n Format In this notation, **m** represents the number of integer bits, and **n** represents the number of fractional bits. For example, the format **Q8.7** means: | Sign | Integer | Precision | |:----:|:-------:|:---------:| | 1 bit | 8 bits | 7 bits | >**Note:** In the ordered form of the Q-format, the value **m** implicitly includes the sign bit. Therefore, the format **Q9.7** indicates the following bit allocation: | Sign | Integer | Precision | |:-----:|:-------:|:----------:| | 1 bit | 8 bits | 7 bits | >**Note:** All formats mentioned in this article are assumed to be signed representations. ## Represented Value **General Formula** (for **Qm.n** format) **In Signed-magnitude:** $$ \text{Represented Value} = (-1)^{\text{sign bit}} \times (\text{Integer Parts} + \frac{\text{Fraction Parts}}{2^n}) $$ **In Two's Complement:** $$ \text{Represented Value} = \frac{\text{2's Complement Integer Value}}{2^{n}} $$ where **Integer Value** denotes the Qm.n bit pattern interpreted as a signed integer. >**Note:** In practice, hardware typically uses two's complement representation to encode negative values, rather than signed-magnitude representation. ### Examples #### ==(Q3.12)== Decimal value: \(5.25\) - Sign bit: `0'b0` - Integer bits: `3'b101` - Fraction bits: `10'b010000000000` **Reasoning:** > Because \(5.25\) is positive, the sign bit is set to 0. -5.25 = (-1)^0*(5 + 0.25), integer: 5 = `3'b101`, and fraction=0.25=`.0010 0000 0000`, combining all parts, the result is: `0 101 010000000000` #### ==(Q5.10)== Decimal value: \(-23.1875\) in *signed-magnitude representation*: - Sign bit: `1'b1` - Integer bits: `5'b10111` - Fraction bits: `10'b0011000000` **Reasoning:** > Because \(-23.1875\) is negative, the sign bit is set to 1. -23.1875 = -1*(23 + 0.1875), integer: 23 = `5'b10111`, and (0.1875) = (0.125) + (0.0625), hence fraction=0.1875=`.0011 0000 00`, combining all parts, the result is: `1 10111 0011000000` in *2's complement representation*: - $23.1875 \times 2^{n}$ = $23744$ (in this case, n equals to 10) convert 23744 (in decimal) into 16-bit binary: (`0101110011000000`) bit-wise NOT: `1010001100111111` plus 1: `1010001101000000` So the result is `1010001101000000` **Why does the procedure for negative numbers differ from that of positive numbers?** > The MSB has the largest weight among all bits and represents the sign; for positive numbers, the MSB is 0. ## Range Coverage ### 16-bit Fixed-point Format(Signed) | Integer | Range | Precision | |----------|-------|-----------| | 0 | $[-1.0, +0.9999694824]$ | $3.05175781 \times 10^{-5}$ | | 1 | $[-2.0, +1.999938965]$ | $6.10351563 \times 10^{-5}$ | | 3 | $[-8.0, +7.99975585938]$ | $2.4414062 \times 10^{-4} | | 5 | $[-32.0, +31.99902344]$ | $9.765625 \times 10^{-4}$ | | 7 | $[-128.0, 127.9960938]$ | 0.00390625 | | 8 | $[-256.0, +255.9921875]$ | 0.0078125 | | ... | $[-2^{\text{Int}},\ 2^{\text{Int}} - 2^{-\text{Frac}}]$ | $2^{-\text{Frac}}$ | | 15 | $[-32{,}768, +32{,}767]$ | 1 | ### Extension: 32-bit Fixed-point Format(Signed) | Integer | Range | Precision | |----------|-------|-----------| | 0 | $[-1.0, +0.9999999995]$ | $4.65661287 \times 10^{-10}$ | | 1 | $[-2.0, +1.9999999991]$ | $9.31322575 \times 10^{-10}$ | | 2 | $[-4.0, +3.99999999814]$ | $1.86264515 \times 10^{-9}$ | | 8 | $[-256.0, +255.99999988]$ | $1.192092895 \times 10^{-7}$ | | 15 | $[-32{,}768.0, +32{,}767.9999847]$ | $1.525878 \times 10^{-5}$ | | 23 | $[-8{,}388{,}608.0, +8{,}388{,}607.99609]$ | 0.00390625 | | ... | $[-2^{\text{Int}},\ 2^{\text{Int}} - 2^{-\text{Frac}}]$ | $2^{-\text{Frac}}$ | | 31 | $[-2{,}147{,}483{,}648.0, +2{,}147{,}483{,}647.0]$ | 1 | **Remark:** Precision $= 2^{-\text{Fraction}}$ >**Note:** The integer bits exclude the sign bit. ## Applications ### ==Q0.15== **Ideal for:** - Lookup tables for sine and cosine functions - Digital filter coefficients, such as those used in **IIR (Infinite Impulse Response)** or **FIR (Finite Impulse Response)** filters - Normalized parameters in machine learning models **Analysis:** - Since the codomain of the sine and cosine functions is ([-1, 1]), this format is particularly suitable for machines that do not support floating-point arithmetic. Moreover, the normalized parameters are also bounded within ([-1, 1]), and the coefficients of digital filters, such as IIR and FIR filters, often lie within this range. --- ### ==Q1.14== **Ideal for:** - Motor control - PID Controller - Audio processing and Digital Signal Processing (DSP) **Analysis:** - This type of data requires higher precision and a smaller range. --- ### ==Q3.12== **Ideal for:** - The PlayStation transformation coprocessors **Analysis:** - This kind of coprocessors do not support FPU (Floating Point Unit), and need more precision than the range. Therefore, this format is suitable for the hardware. --- ### ==Q5.10== **Ideal for:** - Temperature - Angular velocity - Voltage **Analysis:** - This type of physical data typically requires high precision but does not span a large range. --- ### ==Q7.8== **Ideal for:** - Motor control - PID controllers - Most industrial, household, outdoor, and medical temperature measurements, such as LM35, TMP34 and DS18B20 **Analysis:** - The Q7.8 format sacrifices some precision in exchange for a wider representable range, making it suitable for most temperature-related applications in everyday scenarios. With a resolution of 0.00390625, Q7.8 provides sufficient accuracy to meet the requirements of common temperature sensors. --- ### ==Q8.7== **Ideal for:** - Motor control - PID controllers - Voltage - Angular velocity - Acceleration **Analysis:** - Similar to the Q7.8 format, Q8.7 is also well-suited for PID controllers and motor control applications. It’s useful for control systems where signals can occasionally exceed the range of Q7.8 --- ### ==Q15.0== **Ideal for:** - Pure signed integer data **Analysis:** - Since there are no fractional bits in the Q15.0 format, it essentially degenerates into a signed 15-bit integer representation. --- ### ==Q0.31== **Ideal for:** - Lookup tables for sine and cosine functions - Normalized data **Analysis:** - The Q0.31 format offers extremely high precision $4.65661287 \times 10^{-10}$, making it well-suited for representing highly precise values. It is commonly used for normalized coefficients in neural networks, where data—after normalization—is bounded within the range $[-1, 1]$. --- ### ==Q1.30== **Ideal for:** - Butterworth filter, IIR (Infinite Impulse Response), made with second order sections **Analysis:** - This format provides very high precision, and it's dynamic is sufficient for the both filters. --- ### ==Q8.23== **Ideal for:** - High-resolution HDR (High Dynamic Range) image and audio processing - High-precision control systems **Analysis:** - The Q8.23 format is suitable for applications that prioritize precision over range. --- ### ==Q15.16== **Ideal for:** - [Sega Saturn VDP coprocessors](https://www.copetti.org/writings/consoles/sega-saturn/), [VDP1 (Saturn)](https://segaretro.org/VDP1_(Saturn)) and [VDP2 (Saturn) ](https://segaretro.org/VDP2_(Saturn)) - [The Tex typesetting software](https://en.wikipedia.org/wiki/TeX) - Audio Processing - [Intel 80386](https://zh.wikipedia.org/wiki/Intel_80386), [Intel 80486SX](https://zh.wikipedia.org/zh-tw/Intel_80486) - Motor control - PID controllers Analysis: - Q15.16 is also known as Q16.16. The integer range ($\pm 32,767$) is sufficient for most physical quantities. For example, in motor control, motor speed (RPM) or error values rarely exceed $32{,}000$. The fractional resolution ($1.525878 \times 10^{-5}$) is fine enough for smooth control. This balance between integer and fractional bits makes Q16.16 suitable for most application scenarios. --- #### **Case Study: Why [DOOM](https://en.wikipedia.org/wiki/Doom_(franchise)) (1993) used Q16.16?** > Many consumer CPUs in the early 1990s (like the [Intel 486SX](https://en.wikipedia.org/wiki/I486SX)) lacked a Floating Point Unit. Consequently, floating-point operations had to be emulated via software, which was too slow for real-time graphics. Q16.16 allowed the engine to use standard integer instructions(ALU), achieving playable frame rates (around 35 FPS). | Fixed-point Format | Range | Precision | Description | | :---: | :---: |:---: | :---: | | Q24.8 | $[-8{,}388{,}608.0, +8{,}388{,}607.99609]$ | 0.00390625 | Too imprecise. | | Q8.24 | $[-256.0, +255.99999988]$ | $1.192092895 \times 10^{-7}$ | Less Range, More Precision | | Q16.16| $[-32{,}768.0, +32{,}767.9999847]$ | $1.525878 \times 10^{-5}$ | Moderate Range, Moderate Precision | --- > GitHub: [doomdata.h](https://github.com/id-Software/DOOM/blob/master/linuxdoom-1.10/doomdata.h) ```c // doomdata.h // A single Vertex. typedef struct { short x; short y; } mapvertex_t; ``` In DOOM, the engine uses `short`(16-bit signed integer). In C, a signed short has a [range](https://learn.microsoft.com/zh-tw/cpp/cpp/data-type-ranges?view=msvc-170) of $-32{,}768$ to $32{,}767$, which makes the Q16.16 fixed-point format a perfect fit for storing map data. --- > GitHub: [r_draw.c](https://github.com/id-Software/DOOM/blob/master/linuxdoom-1.10/r_draw.c) ```c // r_draw.c void R_DrawColumn (void) { int count; byte* dest; fixed_t frac; fixed_t fracstep; count = dc_yh - dc_yl; // Zero length, column does not exceed a pixel. if (count < 0) return; #ifdef RANGECHECK if ((unsigned)dc_x >= SCREENWIDTH || dc_yl < 0 || dc_yh >= SCREENHEIGHT) I_Error ("R_DrawColumn: %i to %i at %i", dc_yl, dc_yh, dc_x); #endif // Framebuffer destination address. // Use ylookup LUT to avoid multiply with ScreenWidth. // Use columnofs LUT for subwindows? dest = ylookup[dc_yl] + columnofs[dc_x]; // Determine scaling, // which is the only mapping to be done. fracstep = dc_iscale; frac = dc_texturemid + (dc_yl-centery)*fracstep; // Inner loop that does the actual texture mapping, // e.g. a DDA-lile scaling. // This is as fast as it gets. do { // Re-map color indices from wall texture column // using a lighting/special effects LUT. *dest = dc_colormap[dc_source[(frac>>FRACBITS)&127]]; dest += SCREENWIDTH; frac += fracstep; } while (count--); } ``` This code is responsible for rendering vertical wall texture columns. Both `frac` and `fracstep` are of the `fixed` type. `fracstep` determines the texture coordinate increment for each screen pixel. If the precision is insufficient, calculation errors would accumulate over the length of the column, causing **texture swimming** or **visual jitter**. However, the resolution provided by Q16.16 is sufficient to ensure a **stable** visual appearance. In conclusion, Q16.16 provides the sufficient range and precision required for DOOM's engine, making it the ideal balance between map size and visual stability. --- ### ==Q23.8== **Ideal for:** - GPS - Navigation systems **Analysis:** - The Q23.8 format is suitable for scenarios that require a wide representable range but can tolerate lower precision. It is commonly used in standard GPS applications, such as smartwatches, but may not meet the accuracy requirements of industrial or military-grade systems. --- ### ==Q31.0== **Ideal for:** - Very large physical data, such as distance **Analysis:** - Due to its wide dynamic range and low precision requirements, this format is well suited for representing large-scale physical quantities. --- ## Examples 1 ==CMSIS-DSP== >GitHub: [CMSIS-DSP](https://github.com/ARM-software/CMSIS-DSP) [CMSIS-DSP](https://arm-software.github.io/CMSIS_6/latest/DSP/index.html) is an open-source software library that implements common compute processing functions optimized for use on Arm Cortex-M and Cortex-A processors. - [arm_mat_vec_mult_q31.c](https://github.com/ARM-software/CMSIS-DSP/blob/main/Source/MatrixFunctions/arm_mat_vec_mult_q31.c) - [arm_mat_vec_mult_q15.c](https://github.com/ARM-software/CMSIS-DSP/blob/main/Source/MatrixFunctions/arm_mat_vec_mult_q15.c) - [arm_mat_vec_mult_q7.c](https://github.com/ARM-software/CMSIS-DSP/blob/main/Source/MatrixFunctions/arm_mat_vec_mult_q7.c) These three files are all designed to compute the product of a Matrix and a Vector. However, to achieve maximum performance on ARM processors, each of the three files employs a completely different assembly language instruction strategy. All of them aim to compute the product of an $M \times N$ matrix **A** and an $N \times 1$ vector **x**, namely: $$ y = Ax $$ All elements (in the vector, matrix, and scalar results) are represented using the **Q0.31**, **Q0.15**, and **Q0.7** fixed-point formats. --- **Note:** Although these source files include [SIMD](https://zh.wikipedia.org/zh-tw/%E5%8D%95%E6%8C%87%E4%BB%A4%E6%B5%81%E5%A4%9A%E6%95%B0%E6%8D%AE%E6%B5%81)-optimized implementations (utilizing [MVE (M-profile Vector Extension) / Helium](https://www.arm.com/zh-tw/architecture/cpu/m-profile) or [Neon](https://www.arm.com/technologies/neon)), for the sake of clarity, the following discussion focuses exclusively on the **C fallback(non-SIMD)** inplementation. **C fallback**: This refers to the scalar implementation utilized when the target processor lacks MVE (Helium) or Neon support, or when compiler settings do not enable these acceleration features. In such scenarios, the library defaults to a pure C, non-vectorized execution path. > **Note:** Q31, Q15, and Q7 represent **Q0.31**, **Q0.15** and **Q0.7**, respectively. > **Note:** In this article, Q0.31 denotes 0 integer bits excluding the sign bit. In some literature (including ARM documentation), this is referred to as Q1.31 (1 sign bit + 31 fraction bits). ### C Fallback 1. [**arm_mat_vec_mult_q31.c**](https://github.com/ARM-software/CMSIS-DSP/blob/main/Source/MatrixFunctions/arm_mat_vec_mult_q31.c) - Format: [Q0.31](#==Q0.31==), range: $[-1.0, +0.9999999995]$ - Initialization: ```c= uint32_t numRows = pSrcMat->numRows; uint32_t numCols = pSrcMat->numCols; const q31_t *pSrcA = pSrcMat->pData; const q31_t *pInA1; /* input data matrix pointer A of Q31 type */ const q31_t *pInA2; /* input data matrix pointer A of Q31 type */ const q31_t *pInA3; /* input data matrix pointer A of Q31 type */ const q31_t *pInA4; /* input data matrix pointer A of Q31 type */ const q31_t *pInVec; /* input data matrix pointer B of Q31 type */ q31_t *px; /* Temporary output data matrix pointer */ uint16_t i, row, colCnt; /* loop counters */ q31_t matData, matData2, vecData, vecData2; ``` #### Optimization: **[Loop Unrolling](https://en.wikipedia.org/wiki/Loop_unrolling)** <details> <summary>loop unrolling detailed code</summary> ```c= while (row > 0) { /* Initialize accumulators */ q63_t sum1 = 0; q63_t sum2 = 0; q63_t sum3 = 0; q63_t sum4 = 0; /* For every row wise process, the pInVec pointer is set ** to the starting address of the vector */ pInVec = pVec; /* Loop unrolling: process 2 columns per iteration */ colCnt = numCols; /* Initialize pointers to the starting address of the column being processed */ pInA1 = pSrcA + i; pInA2 = pInA1 + numCols; pInA3 = pInA2 + numCols; pInA4 = pInA3 + numCols; // Main loop: matrix-vector multiplication while (colCnt > 0u) { // Read 2 values from vector vecData = *(pInVec)++; // Read 8 values from the matrix - 2 values from each of 4 rows, and do multiply accumulate matData = *(pInA1)++; sum1 += (q63_t)matData * vecData; matData = *(pInA2)++; sum2 += (q63_t)matData * vecData; matData = *(pInA3)++; sum3 += (q63_t)matData * vecData; matData = *(pInA4)++; sum4 += (q63_t)matData * vecData; // Decrement the loop counter colCnt--; } /* Saturate and store the result in the destination buffer */ *px++ = (q31_t)(sum1 >> 31); *px++ = (q31_t)(sum2 >> 31); *px++ = (q31_t)(sum3 >> 31); *px++ = (q31_t)(sum4 >> 31); i = i + numCols * 4; /* Decrement the row loop counter */ row--; } ``` </details> In `arm_mat_vec_mult_q31.c`, [**Loop Unrolling**](https://en.wikipedia.org/wiki/Loop_unrolling) is employed to accelerate execution. The program processes **4 rows** at a time. Within the inner loop, it retrieves a single vector element and concurrently fetches elements from 4 distinct matrix rows, performing multiply-accumulate operations into four seperate accumulators(`sum1` to `sum4`). The bit-wise operation `>>2` is used to calculate `numRows` divided by 4. Here, `row` represents the number of iterations required to process the data in blocks of 4 rows. For instance, if `numRows = 10`, then `row = numRows >> 2` yields `2`. This indicates that the "4-row block processing" loop will execute **twice**. ```c= /* Process 4 rows at a time */     row = numRows >> 2;     i = 0u;     px = pDst; ``` During this process: 1. The first iteration handles Row0, Row1, Row2 and Row3. 2. The second iteration handles Row4, Row5, Row6 and Row7. 3. The remaining 2 rows are handled subsequently. The number of remaining rows is calculated using the following code: ```c= /* process any remaining rows */     row = numRows & 3u; ``` `numRows & 3u` is equivalent to `numRows % 4`, because `3` in binary is `0b11`, which acts as a mask to extract the remainder modulo 4. <details> <summary>processing remaining rows detailed code</summary> ```c= /* process any remaining rows */ row = numRows & 3u; while (row > 0) { q63_t sum = 0; pInVec = pVec; pInA1 = pSrcA + i; colCnt = numCols >> 1; while (colCnt > 0) { vecData = *(pInVec)++; vecData2 = *(pInVec)++; matData = *(pInA1)++; matData2 = *(pInA1)++; sum += (q63_t)matData * vecData; sum += (q63_t)matData2 * vecData2; colCnt--; } // process remainder of row colCnt = numCols & 1u; while (colCnt > 0) { sum += (q63_t)*pInA1++ * *pInVec++; colCnt--; } *px++ = (q31_t)(sum >> 31); i = i + numCols; row--; } ``` </details> #### Saturate Operation(Fixed-point Scaling): ```c= /* Saturate and store the result in the destination buffer */ *px++ = (q31_t)(sum1 >> 31); *px++ = (q31_t)(sum2 >> 31); *px++ = (q31_t)(sum3 >> 31); *px++ = (q31_t)(sum4 >> 31); ``` During the matrix-vector multiplication, multiplying two **Q0.31** elements results in a **64-bit** intermediate value(conceptually **Q1.62**). We need to rescale this value back to the Q0.31 format. Therefore, the accumulated sums(`sum1` through `sum4`) are right-shifted by 31 bits, which is equivalent to dividing by $2^{31}$. For example, given A: Q0.31 and B:Q0.31: $$ \frac{\text{Integer_A}}{2^{31}} \times \frac{\text{Integer_B}}{2^{31}} = \frac{\text{Integer_A} \times \text{Integer_B}}{2^{62}} $$ Scaling (Right Shift 31): $$ \text{Result} = \text{Accumulator} \gg 31 \approx \frac{\text{Accumulator}}{2^{31}} $$ This process effectively truncates the lower 31 fractional bits. > **Note:** Although the comment in the source code claims that this operation performs saturation, it is actually doing truncation. --- 2. [**arm_mat_vec_mult_q15.c**](https://github.com/ARM-software/CMSIS-DSP/blob/main/Source/MatrixFunctions/arm_mat_vec_mult_q15.c) - Format: [Q0.15](#==Q0.15==), range: $[-1.0, +0.9999694824]$ - Initialization: ```c= uint32_t numRows = pSrcMat->numRows; uint32_t numCols = pSrcMat->numCols; const q15_t *pSrcA = pSrcMat->pData; const q15_t *pInA1; /* input data matrix pointer A of Q15 type */ const q15_t *pInA2; /* input data matrix pointer A of Q15 type */ const q15_t *pInA3; /* input data matrix pointer A of Q15 type */ const q15_t *pInA4; /* input data matrix pointer A of Q15 type */ const q15_t *pInVec; /* input data matrix pointer B of Q15 type */ q15_t *px; /* Temporary output data matrix pointer */ uint16_t i, row, colCnt; /* loop counters */ q31_t matData, matData2, vecData, vecData2; ``` ```c= /* Initialize accumulators */ q63_t sum1 = 0; q63_t sum2 = 0; q63_t sum3 = 0; q63_t sum4 = 0; ``` #### Optimization: [__SMLALD](https://developer.arm.com/documentation/ddi0597/2025-09/Base-Instructions/SMLALD--SMLALDX--Signed-Multiply-Accumulate-Long-Dual-) <details> <summary>SMLALD detailed code</summary> ```c= /* ---------------------------------------------------------------------- * Project: CMSIS DSP Library * Title: arm_mat_vec_mult_q15.c * Description: Q15 matrix and vector multiplication * * $Date: 23 April 2021 * * $Revision: V1.9.0 * * Target Processor: Cortex-M and Cortex-A cores * -------------------------------------------------------------------- */ /* * Copyright (C) 2010-2021 ARM Limited or its affiliates. All rights reserved. * * SPDX-License-Identifier: Apache-2.0 * * Licensed under the Apache License, Version 2.0 (the License); you may * not use this file except in compliance with the License. * You may obtain a copy of the License at * * www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ #include "arm_compiler_specific.h" #include "dsp/matrix_functions.h" /** * @ingroup groupMatrix */ /** * @addtogroup MatrixVectMult * @{ */ /** * @brief Q15 matrix and vector multiplication. * @param[in] *pSrcMat points to the input matrix structure * @param[in] *pVec points to input vector * @param[out] *pDst points to output vector */ #if defined(ARM_MATH_MVEI) && !defined(ARM_MATH_AUTOVECTORIZE) #include "arm_helium_utils.h" ARM_DSP_ATTRIBUTE void arm_mat_vec_mult_q15( const arm_matrix_instance_q15 * pSrcMat, const q15_t *pSrcVec, q15_t *pDstVec) { const q15_t *pMatSrc = pSrcMat->pData; const q15_t *pMat0, *pMat1; uint32_t numRows = pSrcMat->numRows; uint32_t numCols = pSrcMat->numCols; q15_t *px; int32_t row; uint16_t blkCnt; /* loop counters */ row = numRows; px = pDstVec; /* * compute 3x64-bit accumulators per loop */ while (row >= 3) { q15_t const *pMat0Vec, *pMat1Vec, *pMat2Vec, *pVec; const q15_t *pMat2; q15_t const *pSrcVecPtr = pSrcVec; q63_t acc0, acc1, acc2; q15x8_t vecMatA0, vecMatA1, vecMatA2, vecIn; pVec = pSrcVec; /* * Initialize the pointer pIn1 to point to the starting address of the column being processed */ pMat0 = pMatSrc; pMat1 = pMat0 + numCols; pMat2 = pMat1 + numCols; acc0 = 0LL; acc1 = 0LL; acc2 = 0LL; pMat0Vec = pMat0; pMat1Vec = pMat1; pMat2Vec = pMat2; pVec = pSrcVecPtr; blkCnt = numCols >> 3; while (blkCnt > 0U) { vecMatA0 = vld1q(pMat0Vec); pMat0Vec += 8; vecMatA1 = vld1q(pMat1Vec); pMat1Vec += 8; vecMatA2 = vld1q(pMat2Vec); pMat2Vec += 8; vecIn = vld1q(pVec); pVec += 8; acc0 = vmlaldavaq(acc0, vecIn, vecMatA0); acc1 = vmlaldavaq(acc1, vecIn, vecMatA1); acc2 = vmlaldavaq(acc2, vecIn, vecMatA2); blkCnt--; } /* * tail * (will be merged thru tail predication) */ blkCnt = numCols & 7; if (blkCnt > 0U) { mve_pred16_t p0 = vctp16q(blkCnt); vecMatA0 = vld1q(pMat0Vec); vecMatA1 = vld1q(pMat1Vec); vecMatA2 = vld1q(pMat2Vec); vecIn = vldrhq_z_s16(pVec, p0); acc0 = vmlaldavaq(acc0, vecIn, vecMatA0); acc1 = vmlaldavaq(acc1, vecIn, vecMatA1); acc2 = vmlaldavaq(acc2, vecIn, vecMatA2); } *px++ = MVE_ASRL_SAT16(acc0, 15); *px++ = MVE_ASRL_SAT16(acc1, 15); *px++ = MVE_ASRL_SAT16(acc2, 15); pMatSrc += numCols * 3; /* * Decrement the row loop counter */ row -= 3; } /* * process any remaining rows pair */ if (row >= 2) { q15_t const *pMat0Vec, *pMat1Vec, *pVec; q15_t const *pSrcVecPtr = pSrcVec; q63_t acc0, acc1; q15x8_t vecMatA0, vecMatA1, vecIn; /* * For every row wise process, the pInVec pointer is set * to the starting address of the vector */ pVec = pSrcVec; /* * Initialize the pointer pIn1 to point to the starting address of the column being processed */ pMat0 = pMatSrc; pMat1 = pMat0 + numCols; acc0 = 0LL; acc1 = 0LL; pMat0Vec = pMat0; pMat1Vec = pMat1; pVec = pSrcVecPtr; blkCnt = numCols >> 3; while (blkCnt > 0U) { vecMatA0 = vld1q(pMat0Vec); pMat0Vec += 8; vecMatA1 = vld1q(pMat1Vec); pMat1Vec += 8; vecIn = vld1q(pVec); pVec += 8; acc0 = vmlaldavaq(acc0, vecIn, vecMatA0); acc1 = vmlaldavaq(acc1, vecIn, vecMatA1); blkCnt--; } /* * tail * (will be merged thru tail predication) */ blkCnt = numCols & 7; if (blkCnt > 0U) { mve_pred16_t p0 = vctp16q(blkCnt); vecMatA0 = vld1q(pMat0Vec); vecMatA1 = vld1q(pMat1Vec); vecIn = vldrhq_z_s16(pVec, p0); acc0 = vmlaldavaq(acc0, vecIn, vecMatA0); acc1 = vmlaldavaq(acc1, vecIn, vecMatA1); } *px++ = MVE_ASRL_SAT16(acc0, 15); *px++ = MVE_ASRL_SAT16(acc1, 15); pMatSrc += numCols * 2; /* * Decrement the row loop counter */ row -= 2; } if (row >= 1) { q15_t const *pMat0Vec, *pVec; q15_t const *pSrcVecPtr = pSrcVec; q63_t acc0; q15x8_t vecMatA0, vecIn; /* * For every row wise process, the pInVec pointer is set * to the starting address of the vector */ pVec = pSrcVec; /* * Initialize the pointer pIn1 to point to the starting address of the column being processed */ pMat0 = pMatSrc; acc0 = 0LL; pMat0Vec = pMat0; pVec = pSrcVecPtr; blkCnt = numCols >> 3; while (blkCnt > 0U) { vecMatA0 = vld1q(pMat0Vec); pMat0Vec += 8; vecIn = vld1q(pVec); pVec += 8; acc0 = vmlaldavaq(acc0, vecIn, vecMatA0); blkCnt--; } /* * tail * (will be merged thru tail predication) */ blkCnt = numCols & 7; if (blkCnt > 0U) { mve_pred16_t p0 = vctp16q(blkCnt); vecMatA0 = vld1q(pMat0Vec); vecIn = vldrhq_z_s16(pVec, p0); acc0 = vmlaldavaq(acc0, vecIn, vecMatA0); } *px++ = MVE_ASRL_SAT16(acc0, 15); } } #else #if defined(ARM_MATH_NEON) #define TMP_DEFINE_AND_INIT(TMP) \ int32x4_t TMP##1 = vdupq_n_s32(0); #define REDUCE(sum,accum) \ tmp1 = vqaddq_s64(accum.val[0],accum.val[1]); \ tmp1 = vqaddq_s64(tmp1,accum.val[2]); \ tmp1 = vqaddq_s64(tmp1,accum.val[3]); \ sum = vgetq_lane_s64(tmp1,0) + vgetq_lane_s64(tmp1,1); #define MAT_SCALAR_DT q15_t #define VEC_SCALAR_DT q15_t #define VECTOR_ACC struct { \ int64x2_t val[4]; \ } #define VECTOR_DT int16x8_t #define SCALAR_ACC int64_t #define HALF_VECTOR_ACC int16x4_t #define NBLANE 8 #define NBLANE_SHIFT 3 #define VECTOR_ACC_INIT(acc) \ acc.val[0] = vdupq_n_s64(0) ; \ acc.val[1] = vdupq_n_s64(0) ; \ acc.val[2] = vdupq_n_s64(0) ; \ acc.val[3] = vdupq_n_s64(0) ; #define SCALAR_ACC_INIT(acc) \ acc = 0 #define VEC_LOAD(v,p) \ v = vld1q_s16((p)) #define VMAC(ACC,VA,VB) \ tmp1 = vmull_s16(vget_low_s32(VA),vget_low_s32(VB)); \ ACC.val[0] = vaddq_s64(ACC.val[0],vmovl_s32(vget_low_s32(tmp1))); \ ACC.val[1] = vaddq_s64(ACC.val[1],vmovl_s32(vget_high_s32(tmp1))); \ tmp1 = vmull_s16(vget_high_s32(VA),vget_high_s32(VB)); \ ACC.val[2] = vaddq_s64(ACC.val[2],vmovl_s32(vget_low_s32(tmp1))); \ ACC.val[3] = vaddq_s64(ACC.val[3],vmovl_s32(vget_high_s32(tmp1))); #define SCALAR_MAC(ACC,MAT,VEC) \ ACC = ACC + (int64_t)(MAT) * (int64_t)(VEC) #define STORE_SCALAR_ACC(DST,ACC) \ DST = __SSAT(ACC>>15,16) #define FUNCNAME arm_mat_vec_mult_q15 #define MATRIX_TYPE arm_matrix_instance_q15 #include "_arm_mat_vec_mult_neon.c" #else ARM_DSP_ATTRIBUTE void arm_mat_vec_mult_q15(const arm_matrix_instance_q15 *pSrcMat, const q15_t *pVec, q15_t *pDst) { uint32_t numRows = pSrcMat->numRows; uint32_t numCols = pSrcMat->numCols; const q15_t *pSrcA = pSrcMat->pData; const q15_t *pInA1; /* input data matrix pointer A of Q15 type */ const q15_t *pInA2; /* input data matrix pointer A of Q15 type */ const q15_t *pInA3; /* input data matrix pointer A of Q15 type */ const q15_t *pInA4; /* input data matrix pointer A of Q15 type */ const q15_t *pInVec; /* input data matrix pointer B of Q15 type */ q15_t *px; /* Temporary output data matrix pointer */ uint16_t i, row, colCnt; /* loop counters */ q31_t matData, matData2, vecData, vecData2; /* Process 4 rows at a time */ row = numRows >> 2; i = 0u; px = pDst; /* The following loop performs the dot-product of each row in pSrcA with the vector */ /* row loop */ while (row > 0) { /* Initialize accumulators */ q63_t sum1 = 0; q63_t sum2 = 0; q63_t sum3 = 0; q63_t sum4 = 0; /* For every row wise process, the pInVec pointer is set ** to the starting address of the vector */ pInVec = pVec; /* Loop unrolling: process 2 columns per iteration */ colCnt = numCols >> 1; /* Initialize pointers to the starting address of the column being processed */ pInA1 = pSrcA + i; pInA2 = pInA1 + numCols; pInA3 = pInA2 + numCols; pInA4 = pInA3 + numCols; // Main loop: matrix-vector multiplication while (colCnt > 0u) { // Read 2 values from vector vecData = read_q15x2_ia (&pInVec); // Read 8 values from the matrix - 2 values from each of 4 rows, and do multiply accumulate matData = read_q15x2_ia (&pInA1); sum1 = __SMLALD(matData, vecData, sum1); matData = read_q15x2_ia (&pInA2); sum2 = __SMLALD(matData, vecData, sum2); matData = read_q15x2_ia (&pInA3); sum3 = __SMLALD(matData, vecData, sum3); matData = read_q15x2_ia (&pInA4); sum4 = __SMLALD(matData, vecData, sum4); // Decrement the loop counter colCnt--; } /* process any remaining columns */ colCnt = numCols & 1u; if (numCols & 1u) { vecData = *pInVec++; sum1 += (q63_t)*pInA1++ * vecData; sum2 += (q63_t)*pInA2++ * vecData; sum3 += (q63_t)*pInA3++ * vecData; sum4 += (q63_t)*pInA4++ * vecData; } /* Saturate and store the result in the destination buffer */ *px++ = (q15_t)(__SSAT((sum1 >> 15), 16)); *px++ = (q15_t)(__SSAT((sum2 >> 15), 16)); *px++ = (q15_t)(__SSAT((sum3 >> 15), 16)); *px++ = (q15_t)(__SSAT((sum4 >> 15), 16)); i = i + numCols * 4; /* Decrement the row loop counter */ row--; } /* process any remaining rows */ row = numRows & 3u; while (row > 0) { q63_t sum = 0; pInVec = pVec; pInA1 = pSrcA + i; // loop unrolling - process 4 elements at a time colCnt = numCols >> 2; while (colCnt > 0) { vecData = read_q15x2_ia (&pInVec); vecData2 = read_q15x2_ia (&pInVec); matData = read_q15x2_ia (&pInA1); matData2 = read_q15x2_ia (&pInA1); sum = __SMLALD(matData, vecData, sum); sum = __SMLALD(matData2, vecData2, sum); colCnt--; } // process remainder of row colCnt = numCols & 3u; while (colCnt > 0) { sum += (q63_t)*pInA1++ * *pInVec++; colCnt--; } *px++ = (q15_t)(__SSAT((sum >> 15), 16)); i = i + numCols; row--; } } #endif /* ARM_MATH_NEON */ #endif /* defined(ARM_MATH_MVEI) */ /** * @} end of MatrixMult group */ ``` </details> Although the overall loop structure (including loop unrolling) is similar to the Q31 implementation, the Q15 version fully leverages the advantages of the **32-bit register width**. The most significant difference is that Q15 uses **data packing**, combining two 16-bit values into a single 32-bit register, and utilizes **__SMLAML** to perform dualmultiply-accumulate operations per cycle. Furthermore, unlike Q31 which relies on implicit truncation, the Q15 implementation employs the **explicit saturation instruction (__SSAT)** at the output state to ensure numerical safety. **Data Packing:** Packing two Q15 (16-bit) values into one 32-bit register. Register Layout: ``` | High 16-bit | Low 16-bit | | Element n+1 | Element n | ``` ```c= // Read 2 values from vector vecData = read_q15x2_ia (&pInVec); matData = read_q15x2_ia (&pInA1); ``` The [`read_q15x2_ia`](https://arm-software.github.io/CMSIS-DSP/v1.15.0/arm__math__memory_8h.html#a40a3800c324674c186680fed6f3b70aa)function reads two 16-bit numbers from memory at once, loading them into a 32-bit countainer(such as `vecData` or `matData`). `read_q15x2_ia` executes three key actions: 1. Fetch: Reads 32 bits of data (two consecutive 16-bit Q15 values) from the current memory address in a single operation. 2. Pack: Pack these values into a single 32-bit variable(`q31_t`): - Lower 16 bits(bit 0 to bit 15): Holds the n-th element. - Upper 16 bits(bit 16 to bit 31): Holds the (n+1)-th element. 3. Increment: Increment the pointer by 4 bytes to prepare for the next cycle. By merging two **short integer** into a single **word** for parallel processing, this approach effectively **halves the number of memory accesses** and **double the overall data throughput**. ```c= // Read 8 values from the matrix - 2 values from each of 4 rows, and do multiply accumulate matData = read_q15x2_ia (&pInA1); sum1 = __SMLALD(matData, vecData, sum1); matData = read_q15x2_ia (&pInA2); sum2 = __SMLALD(matData, vecData, sum2); matData = read_q15x2_ia (&pInA3); sum3 = __SMLALD(matData, vecData, sum3); matData = read_q15x2_ia (&pInA4); sum4 = __SMLALD(matData, vecData, sum4); ``` **[__SMLALD](https://developer.arm.com/documentation/ddi0597/2025-09/Base-Instructions/SMLALD--SMLALDX--Signed-Multiply-Accumulate-Long-Dual-)** stands for **Signed Multiply Accumulate Long Dual**, it performs two signed 16 x 16-bit multiplications. It adds the products to a 64-bit accumulate operand. Overflow is only possible as a result of the 64-bit addition. This overflow is not detected if it occurs. Instead, the result wraps around modulo264. [Description of __SMLALD](https://arm-software.github.io/CMSIS_5/Core/html/group__intrinsic__SIMD__gr.html#gad80e9b20c1736fd798f897362273a146) Parameters: - val1 first 16-bit op erands for each multiplication. - val2 second 16-bit operands for each multiplication. - val3 accumulate value. ```clike= p1 = val1[15:0] * val2[15:0] p2 = val1[31:16] * val2[31:16] sum = p1 + p2 + val3[63:32][31:0] res[63:32] = sum[63:32] res[31:0] = sum[31:0] ``` In other words, `-_SMLALD` performs the following oeprations: 1. Split the 32-bit `matData` into upper and lower 16-bit halves. 2. Split the 32-bit `vecData` into upper and lower 16-bit halves. 3. Perform two mulplitcations simultaneously: - $\text{lower 16 bits} \times \text{lower 16 bits}$ - $\text{upper 16 bits} \times \text{upper 16 bits}$ 4. Sum the products. 5. Accumulate the result into the 64-bit accumulator `sum1`. $$ \text{Sum1} = \text{Sum1} + (\text{matData}_{\text{high}} \times \text{vecData}_{\text{high}}) + (\text{matData}_{\text{low}} \times \text{vecData}_{\text{low}}) $$ Unlike the Q31 implementation—where each instruction can process only one pair of values—the Q15 version uses `__SMLALD` to compute **two pairs at once**, effectively doubling the throughput. #### Saturate Operation: **[__SSAT](https://arm-software.github.io/CMSIS_5/Core/html/group__intrinsic__SIMD__gr.html#:~:text=uint32_t%20__SSAT16,)** enables the saturation of **two signed 16-bit values** to a selected signed range. ```c= *px++ = (q15_t)(__SSAT((sum >> 15), 16)); ``` The `__SSAT` instruction constains the shifted accumulator value (`sum >> 15`) to a range defined by the **target bit-width of 16** (specified as the second argument). * Upper Bound: If the value exceeds the maximum 16-bit signed integer ($2^{15} - 1 = 32767$), it is saturated to $32767$. * Lower Bound: If the value falls below the minimum 16-bit signed integer ($-2^{15} = -32768$), it is saturated to $-32768$. --- 3. [**arm_mat_vec_mult_q7.c**](https://github.com/ARM-software/CMSIS-DSP/blob/main/Source/MatrixFunctions/arm_mat_vec_mult_q7.c) - Format: Q0.7, range: $[-1.0, +0.9921875]$ - Initialization: ```c= uint32_t numRows = pSrcMat->numRows; uint32_t numCols = pSrcMat->numCols; const q7_t *pSrcA = pSrcMat->pData; const q7_t *pInA1; /* input data matrix pointer of Q7 type */ const q7_t *pInA2; /* input data matrix pointer of Q7 type */ const q7_t *pInA3; /* input data matrix pointer of Q7 type */ const q7_t *pInA4; /* input data matrix pointer of Q7 type */ const q7_t *pInVec; /* input data vector pointer of Q7 type */ q7_t *px; /* output data pointer */ uint32_t i, row, colCnt; /* loop counters */ ``` #### Optimization: [__SXTB16](https://mikhailarkhipov.github.io/ARM-doc/A32/sxtb16.html) + [__SMLAD](https://developer.arm.com/documentation/dui0472/m/ARMv6-SIMD-Instruction-Intrinsics/--smlad-intrinsic) The Q7 implementation follows the strategy **"Packed Load -> Unpack -> Parallel [MAC](https://zh.wikipedia.org/zh-tw/%E4%B9%98%E7%A9%8D%E7%B4%AF%E5%8A%A0%E9%81%8B%E7%AE%97)"**. > On 32-bit ARM Cortex-M4/M7 processors, there is **no instruction** that can directly perform "four 8-bit multiplies + accumulate in one operation." > Therefore, we must take an **indirect** approach. ```c= // Read 4 values from vector vecData = read_q7x4_ia (&pInVec); vecData2 = __SXTB16(__ROR(vecData, 8)); vecData = __SXTB16(vecData); // Read 16 values from the matrix - 4 values from each of 4 rows, and do multiply accumulate matData = read_q7x4_ia (&pInA1); matData2 = __SXTB16(__ROR(matData, 8)); matData = __SXTB16(matData); sum1 = __SMLAD(matData, vecData, sum1); sum1 = __SMLAD(matData2, vecData2, sum1); ``` After calling `read_q7x4_ia`, a single 32-bit register is **packed** with **four Q7 values**. Register Layout: ``` | 8-bit | 8-bit | 8-bit | 8-bit | | Element n+3 | Element n+2 | Element n+1 | Element n | ``` This is where the Q7 approach diverges significantly from Q15. Since the hardware lacks a direct instruction for multiplying packed 8-bit values, we must perform an intermediate step: **unpacking** the four 8-bit numbers into two sets of 16-bit numbers. Once unpacked, we can leverage the `__SMLALD` instruction to replicate the efficiency of the Q15 implementation. To accomplish this, ARM provides the `__SXTB16` (**Signed Extend Byte 16**) instruction. This instruction extracts **Byte 0**(bits 0-7) and **Byte 2**(bits 16-23) from a 32-bit register and **sign-extends** them into two 16-bit signed values. ``` Source(32 bits): | Byte 3 | Byte 2 | Byte 1 | Byte 0 | ^^^^^^ ^^^^^^ Extract & Sign-extend these two bytes Result(32 bits): | Signed-Extended Byte 2 | Signed-Extended Byte 0 | ``` ```c= // Read 4 values from vector vecData = read_q7x4_ia (&pInVec); vecData2 = __SXTB16(__ROR(vecData, 8)); vecData = __SXTB16(vecData); ``` ```c= vecData = read_q7x4_ia(&pInVec); ``` The function `read_q7x4_ia` fetches four contiguous bytes from memory in a single operation. Assuming a memory sequence of elements $n, n+1, n+3$, the layout within the 32-bit register `vecData` is structured as follows: ``` | Bit 31-24 | Bit 23-16 | Bit 15-8 | Bit 7-0 | | Element n+3 | Element n+2 | Element n+1 | Element n | | (Byte3) | (Byte2) | (Byte1) | (Byte0) | ``` The [`__ROR`](https://developer.arm.com/documentation/dui0375/g/Compiler-specific-Features/--ror-intrinsic) instruction rotates the value right by a specified number of bits, such that the bits shifted out from the right (LSB) **wrap around** to the left (MSB). For example: ```c= uint32_t uint32value = 0x12345678; uint32value = __ROR(uint32value, 8); ``` ``` Orig. Byte: | 3 | 2 | 1 | 0 | Before : | 12 | 34 | 56 | 78 | ^ | _______________| | wrap around to MSB v After: | 78 | 12 | 34 | 56 | Orig. Byte: | 0 | 3 | 2 | 1 | ``` ```c= vecData2 = __SXTB16(__ROR(vecData, 8)); ``` Since `__SXTB16` extracts **Byte 0** and **Byte 2** from the register, we use `__ROR(vecData, 8)` to move **Byte 1** and **Byte 3** into those target positions. ```c= vecData = __SXTB16(vecData); ``` Then, we use `__SXTB16(vecData)` directly to extract the remaining **Byte 0** and **Byte 2**. Repeat the same operation to extract the **corresponding bytes** from `matData`. ```c= matData = read_q7x4_ia (&pInA1); matData2 = __SXTB16(__ROR(matData, 8)); // Extracts elements n+1, n+3 matData = __SXTB16(matData); // Extracts elements n, n+2 ``` ``` | Bit 31-16 | Bit 15-0 | vecData: | Byte 2 | Byte 0 | vecData2: | Byte 3 | Byte 1 | matData: | Byte 2 | Byte 0 | matData2: | Byte 3 | Byte 1 | ``` With the data **sign-extended to 16-bit integers**, we can once again **leverage the powerful `__SMLAD` instruction**. ```c= sum1 = __SMLAD(matData, vecData, sum1); sum1 = __SMLAD(matData2, vecData2, sum1); ``` #### [`__SMLAD`](https://developer.arm.com/documentation/dui0491/g/ARMv6-SIMD-Instruction-Intrinsics/--smlad-intrinsic) v.s [`__SMLALD`](https://developer.arm.com/documentation/dui0491/i/ARMv6-SIMD-Instruction-Intrinsics/--smlald-intrinsic) For Q7 operations, we typically emply `__SMLAD` (which uses a 32-bit accumulator). In contrast, high-precision Q15 operations often require `SMLALD`(which uses a 64-bit accumulator). - `SMLAD` : $16-\text{bit} \times 16-\text{bit} + 16-\text{bit} \times 16-\text{bit} + 32-\text{bit} \to 32-\text{bit}$ - `SMLALD`: $16-\text{bit} \times 16-\text{bit} + 16-\text{bit} \times 16-\text{bit} + 64-\text{bit} \to 64-\text{bit}$ > Note: Since the product of two Q7 values is relatively small (resulting in a Q14 value), it is unlikely to cause a 32-bit overflow during standard accumulation. Therefore, using `__SMLAD` provides sufficient precision and is well-suited for this use case. #### Saturate Operation: ```c= /* Saturate and store the result in the destination buffer */ *px++ = (q7_t)(__SSAT((sum1 >> 7), 8)); *px++ = (q7_t)(__SSAT((sum2 >> 7), 8)); *px++ = (q7_t)(__SSAT((sum3 >> 7), 8)); *px++ = (q7_t)(__SSAT((sum4 >> 7), 8)); ``` Since Q7 $\times$ Q7 = Q14 ($2^{-7} \times 2^{-7} = 2^{-14}$), we need to divide by $2^{7}$ (effectively implement as `>>7`) to scale the result. * Upper Bound: If the value exceeds the maximum 8-bit signed integer ($2^{7} - 1 = 127$), it is saturated to $127$. * Lower Bound: If the value falls below the minimum 8-bit signed integer ($-2^{7} = -128$), it is saturated to $-128$. --- 1. **Truncation Error:** * Error Range: [$0, 2^{-n}$] for $\text{Qm,n}$ - Q31: shift right 31 bits - $|\text{error}| < 2^{-31}(\approx 4.65661287 \times 10^{-10})$ - Q15: shift right 15 bits - $|\text{error}| < 2^{-15}(\approx 3.05175781 \times 10^{-5})$ - Q7 : shift right 7 bits - $|\text{error}| < 0.0078125$ 2. **Saturation Error:** * if **Result > Max**, then **Result = Max**, where **Max** is the max value of Q31/15/7. * if **Result < Min**, then **Result = Min**, where **Min** is the min value of Q31/15/7. * Definition: $\text{Error} = \text{Value}_{\text{True}} - \text{Value}_{\text{Saturated}}$ * Purpose: **The Lesser of Two Evils**, Although saturation introduces a loss of precision by altering the true result, it is a critical safety mechanism designed to prevent the catastrophic consequences of [**Wrap-around**](https://en.wikipedia.org/wiki/Integer_overflow) (Overflow). **What is [Wrap-around](https://en.wikipedia.org/wiki/Integer_overflow)?** > In standard binary arithmetic, when a value exceeds the maximum limit, it "wraps around" to the minimum limit. 3. **Applications:** - References: - [Digital signal processing for STM32 microcontrollers using CMSIS ](https://www.st.com/resource/en/application_note/an4841-digital-signal-processing-for-stm32-microcontrollers-using-cmsis-stmicroelectronics.pdf) - [ARM-software CMSIS-NN](https://github.com/ARM-software/CMSIS-NN) - [ARM graphic equalizer](https://github.com/istarc/stm32/blob/master/STM32F4-Discovery_FW_V1.1.0/Libraries/CMSIS/DSP_Lib/Examples/arm_graphic_equalizer_example/arm_graphic_equalizer_example_q31.c) - [TensorFlow Lite Micro](https://github.com/search?q=repo%3Atensorflow%2Ftflite-micro+%22fixed-point%22&type=code) - [Digital signal processing for STM32 microcontrollers using CMSIS](https://github.com/ARM-software/CMSIS_4/blob/master/CMSIS/DSP_Lib/Source/FilteringFunctions/arm_fir_fast_q15.c) - Q31: FFT, Low-pass FIR Filter, [ARM graphic equalizer](https://github.com/istarc/stm32/blob/master/STM32F4-Discovery_FW_V1.1.0/Libraries/CMSIS/DSP_Lib/Examples/arm_graphic_equalizer_example/arm_graphic_equalizer_example_q31.c) - Q15: FFT, Low-pass FIR Filter, High-pass FIR Filter, [Q15 Fast FIR](https://github.com/ARM-software/CMSIS_4/blob/master/CMSIS/DSP_Lib/Source/FilteringFunctions/arm_fir_fast_q15.c) - Q7 : TinyML, Edge AI(such as [CMSIS-NN](https://github.com/ARM-software/CMSIS-NN)) --- ### [MVE](https://www.arm.com/zh-tw/technologies/helium)/[Neon](https://www.arm.com/zh-tw/technologies/neon) #### MVE(M-profile Vector Extension) / Helium vs. Neon Architecture MVE (Helium) is an Arm vector extension technology specifically designed for Cortex-M microcontrollers. It aims to bring high-performance DSP and capabilites-previously exclusice to Cortex-A (Neon) processors-to embedded endpoints. |**Feature**|**MVE (Helium)**|**Neon**| |---|---|---| |Target Architecture|Arm Cortex-M (e.g., Cortex-M55, M85)|Arm Cortex-A / Cortex-R| |Vector Width|**128-bit** (Shared with FPU)|**128-bit** (Separate Register File)| |Key Advantage|ML/DSP Acceleration in Embedded Systems|High Throughput and General-Purpose Acceleration| |Primary Use Case|**TinyML/Edge AI Inference**|**Multimedia Processing (Video Codecs, Image Processing), High-Performance DSP**| #### In [CMSIS-DSP](https://github.com/ARM-software/CMSIS-DSP) All MVE vector registers have a fixed width of **128 bits**. The vector is divided into multiple **Lanes** based on the data type size. - **Q31 (32-bit):** - 4 Lanes: $128 \text{ bits} / 32 = \mathbf{4}$ - Data Type:`q31x4_t` - Operation: Performs 4 multiplications simultaneously in a single instruction cycle - **Q15 (16-bit):** - 8 Lanes: $128 \text{ bits} / 16 = \mathbf{8}$ - Data Type:`q15x8_t` - Operation: Performs 8 multiplications simultaneously in a single instruction cycle - Throughput: **2x** that of Q31 - **Q7 (8-bit):** - 16 Lanes: $128 \text{ bits} / 8 = \mathbf{16}$ - Data Type:`q7x16_t` - Operation: Performs 16 multiplications simultaneously in a single instruction cycle - Throughput: **4x** that of Q31 **What is Lane?** > Note: A **Lane** represents the smallest unit of parallel processing within a SIMD vector register. --- In [arm_mat_vec_mult_q31.c](https://github.com/ARM-software/CMSIS-DSP/blob/main/Source/MatrixFunctions/arm_mat_vec_mult_q31.c) and [arm_mat_vec_mult_q15.c](https://github.com/ARM-software/CMSIS-DSP/blob/main/Source/MatrixFunctions/arm_mat_vec_mult_q15.c): * Using `vmlaldavaq` (Long) ```c= # Q31 implementation vecMatA0 = vld1q(pMat0Vec); pMat0Vec += 4; # Q15 implementation vecMatA0 = vld1q(pMat0Vec); pMat0Vec += 8; ``` ```c= acc0 = vmlaldavaq(acc0, vecIn, vecMatA0); acc1 = vmlaldavaq(acc1, vecIn, vecMatA1); acc2 = vmlaldavaq(acc2, vecIn, vecMatA2); ``` [`vld1q`](https://developer.arm.com/architectures/instruction-sets/intrinsics/vld1q_u8) stands for Vector Load 1 Quad-word, this instruction is used to fetch a contiguous **128-bit** block of data from memory into a vector register in a single operation. - Q31 (32-bit): Loads **4** $\times$ 32-bit elements. - Q15 (16-bit): Loads **8** $\times$ 16-bit elements. [`vmlaldavaq`](https://developer.arm.com/architectures/instruction-sets/intrinsics/%5B__arm_%5Dvmlaldavaq%5B_s16%5D) stands for Vector Multiply Accumulate **Long** Dual Across Vector Accumulate. It performs the dot product with a **64-bit accumulator** to prevent overflow. - Q31: Multiplies corresponding 32-bit elements to generate **64-bit products**, sums these products across the vector, and accumulates the result into the **64-bit** accumulator (`acc0`). - Q15: Multiplies corresponding 16-bit elements to generate **32-bit products**, sums these products across the vector, and accumulates the result into the **64-bit** accumulator (`acc0`). In [arm_mat_vec_mult_q7.c](https://github.com/ARM-software/CMSIS-DSP/blob/main/Source/MatrixFunctions/arm_mat_vec_mult_q7.c): * Using `vmladavaq` (Standard) ```c= vecMatA0 = vld1q(pMat0Vec); pMat0Vec += 16; ``` ```c= acc0 = vmladavaq(acc0, vecIn, vecMatA0); acc1 = vmladavaq(acc1, vecIn, vecMatA1); acc2 = vmladavaq(acc2, vecIn, vecMatA2); acc3 = vmladavaq(acc3, vecIn, vecMatA3); ``` [`vmladavaq`](https://developer.arm.com/architectures/instruction-sets/intrinsics/%5B__arm_%5Dvmladavaq%5B_s8%5D) stands for Vector Multiply Accumulate Dual Across Vector Accumulate. - Q7: Multiplies corresponding **16 pairs** of 8-bit elements to generate **16-bit products**. These products are then summed together (including the existing `acc0` value), and the final result is accumulated into a **32-bit** accumulator (`acc0`). ___ In the C fallback implementation, Q7 is the least efficient format due to the overhead of data unpacking. Since the CPU lacks native SIMD instructions for "8-bit $\times$ 8-bit accumulation into 32-bit," the code must utilize SXTB16 to promote 8-bit values to 16-bit before performing arithmetic operations. In contrast, the MVE implementation handles Q7 with remarkable efficiency. MVE features a native 8-bit dot-product instruction (vmladavaq), which eliminates the need for unpacking entirely. It directly multiplies 16 pairs of Q7 elements from the vector registers and accumulates the results in a single, atomic operation. In conclusion, the introduction of MVE turns Q7 from the most cumbersome format into one of the most efficient. With native 8-bit SIMD support, MVE removes the unpacking bottlenecks inherent in scalar implementations, enabling Q7 data to fully exploit the 128-bit vector width and achieve significantly higher throughput in TinyML and DSP workloads. ## Examples 2 ==DOOM== > GitHub: [DOOM](https://github.com/id-Software/DOOM) > GitHub: [m_fixed.c](https://github.com/id-Software/DOOM/blob/master/linuxdoom-1.10/m_fixed.c) [DOOM](https://en.wikipedia.org/wiki/Doom_(franchise)) is a legendary first-person shooter video game developed by id Software and first released in 1993. It is widely regarded as one of the most influential titles in gaming history. > The following content is sourced from [Wikipedia](https://en.wikipedia.org/wiki/Doom_(1993_video_game)): > Doom was the last first-person shooter game by id Software to use a 16.16 fixed point representation for all of its non-integer computations, including map system, geometry, rendering, and player movement. This representation is still used in modern Doom source ports. The implementation in [m_fixed.c](https://github.com/id-Software/DOOM/blob/master/linuxdoom-1.10/m_fixed.c) utilizes the Q16.16 format. This serves as a practical example of the 32-bit Extension discussed in the previous section. --- ### [Recap Q16.16](#==Q15.16==) **Q16.16 is functionally identical to Q15.16.** The term "Q16.16" is simply an **alternative notation** often used in software development to indicate that the 32-bit word is split evenly: 16 bits for the integer (including sign) and 16 bits for the fraction. **Structure:** |Sign|Integer|Precision| |:---:|:---:|:---:| |1 bit|15 bits|16 bits| **Range:** [$-32{,}768.0, +32{,}767.9999847$] **Precision:** $2^{-16} \approx 0.000015259$ --- ### [m_fixed.h](https://github.com/id-Software/DOOM/blob/master/linuxdoom-1.10/m_fixed.h) ```c= // // Fixed point, 32bit as 16.16. // #define FRACBITS 16 #define FRACUNIT (1<<FRACBITS) typedef int fixed_t; ``` [m_fixed.h](https://github.com/id-Software/DOOM/blob/master/linuxdoom-1.10/m_fixed.h) defines a `fixed_t` type for storing Q16.16 fixed-point values. Since early 1990s CPUs—such as the Intel 80486SX—often lacked a floating-point unit (FPU), the DOOM development team used this simple fixed-point representation to perform calculations that would otherwise require floating-point arithmetic. ```c #define FRACBITS 16 ``` Since DOOM utilizes the Q16.16 fixed-point format, 16 bits are allocated for the integer part, and 16 bits for the fractional part. ```c #define FRACUNIT (1<<FRACBITS) ``` `FRACUNIT` represents the value 1.0 in fixed-point arithmetic. It is the scaling factor ($2^{16}$ or 65536) used to convert between integers and fixed-point numbers. --- Back to m_fixed.c, I will explain the code in detail. <details> <summary>m_fixed.c full code</summary> ```c= // Emacs style mode select -*- C++ -*- //----------------------------------------------------------------------------- // // $Id:$ // // Copyright (C) 1993-1996 by id Software, Inc. // // This source is available for distribution and/or modification // only under the terms of the DOOM Source Code License as // published by id Software. All rights reserved. // // The source is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // FITNESS FOR A PARTICULAR PURPOSE. See the DOOM Source Code License // for more details. // // $Log:$ // // DESCRIPTION: // Fixed point implementation. // //----------------------------------------------------------------------------- static const char rcsid[] = "$Id: m_bbox.c,v 1.1 1997/02/03 22:45:10 b1 Exp $"; #include "stdlib.h" #include "doomtype.h" #include "i_system.h" #ifdef __GNUG__ #pragma implementation "m_fixed.h" #endif #include "m_fixed.h" // Fixme. __USE_C_FIXED__ or something. fixed_t FixedMul ( fixed_t a, fixed_t b ) { return ((long long) a * (long long) b) >> FRACBITS; } // // FixedDiv, C version. // fixed_t FixedDiv ( fixed_t a, fixed_t b ) { if ( (abs(a)>>14) >= abs(b)) return (a^b)<0 ? MININT : MAXINT; return FixedDiv2 (a,b); } fixed_t FixedDiv2 ( fixed_t a, fixed_t b ) { #if 0 long long c; c = ((long long)a<<16) / ((long long)b); return (fixed_t) c; #endif double c; c = ((double)a) / ((double)b) * FRACUNIT; if (c >= 2147483648.0 || c < -2147483648.0) I_Error("FixedDiv: divide by zero"); return (fixed_t) c; } ``` </details> **1. FixedMul** ```c= fixed_t FixedMul ( fixed_t a, fixed_t b ) { return ((long long) a * (long long) b) >> FRACBITS; } ``` From [Data Type Ranges](https://learn.microsoft.com/zh-tw/cpp/cpp/data-type-ranges?view=msvc-170), I build the following table. | type | size | range | |:-------------:|:-------------:|:-------------:| | int | 4 bytes | $$-2,147,483,648 \text{ to } 2,147,483,647$$ | | long long | 8 bytes | $$-9,223,372,036,854,775,808 \text{ to } 9,223,372,036,854,775,807$$ | The code above performs multiplication between two **Q16.16** fixed-point numbers. Since each value is scaled by $2^{16}$, their product becomes scaled by $2^{32}$. To convert the result back to the Q16.16 format, we rescale it by shifting the product right by `FRACBITS` (16 bits). **Type casting is necessary** because multiplying two Q16.16 numbers results in a 64-bit value (conceptually **Q32.32**). The **lower 32 bits** represent the **fractional part**, while the **upper 32 bits** contain the **integer part**. Without casting to **64-bit (long long)**, the standard multiplication would truncate the upper 32 bits, causing us to lose the integer portion of the result. For instance, if without type casting: ```c fixed_t a = 50000; fixed_t b = 50000; ``` a * b = $2{,}500{,}000{,}000$. However, since the maximum value of a 32-bit signed integer is only $2{,}147{,}483{,}647$, this calculation results in **sign overflow**. To solve this, we type cast the operands to `long long`. This promotes the calculation to a **64-bit intermediate value**, effectively **preventing integer overflow**. Take $1.0 \times 1.0$ as another example. Even though it is not a large number, $1.0 \times 1.0$ will cause an overflow. This is because 1.0 in Q16.16 representation is $2^{16}$, so the operation becomes $2^{16} \times 2^{16} = 2^{32}$. This value ($2^{32}$) exceeds the 32-bit signed integer maximum ($2^{31}-1$) and even exceeds the upper bound of the unsigned integer maximum ($2^{32}-1$). <p style="text-align:center;"> <img src="https://hackmd.io/_uploads/ByBMYPlfbx.png" width="500"> </p> --- **2. FixedDiv & FixedDiv2** ```c= fixed_t FixedDiv ( fixed_t a, fixed_t b ) { if ( (abs(a)>>14) >= abs(b)) return (a^b)<0 ? MININT : MAXINT; return FixedDiv2 (a,b); } ``` ```c= fixed_t FixedDiv2 ( fixed_t a, fixed_t b ) { #if 0 long long c; c = ((long long)a<<16) / ((long long)b); return (fixed_t) c; #endif double c; c = ((double)a) / ((double)b) * FRACUNIT; if (c >= 2147483648.0 || c < -2147483648.0) I_Error("FixedDiv: divide by zero"); return (fixed_t) c; } ``` The role of FixedDiv is to perform a quick **Overflow Protection** check. If `a` has a very large magnitude and `b` has a very small magnitude, the value of `a div b` may exceed the range that a 32-bit integer can represent. Without handling this case, the result would be incorrect. To prevent this, the function predicts overflow and clamps the result: It uses `a^b` (bitwise XOR) to check the signs. * Same sign: Returns MAXINT. * Different signs: Returns MININT. It calls FixedDiv2 to perform the actual division only if the value is confirmed to be within the safe range. #### Question: Why use (a^b) < 0 to check signs? >Instead of the verbose if ((a>0 && b<0) || (a<0 && b>0)), the expression (a^b) < 0 compiles to fewer CPU instructions, making it a highly optimized trick. >Example: Let's take two 8-bit numbers where both are negative (MSB is 1): a = 10001111 (Negative), b = 10001110 (Negative) >Bitwise XOR Operation: ``` 10001111 (a) ^ 10001110 (b) -------- 00000001 (Result) ^ Look at the MSB ``` >The result is 00000001. In signed integers, an MSB of 0 means the number is positive. >Since the result is positive, the condition (a^b) < 0 is FALSE. >This confirms that a and b have the SAME sign. In conclusion, only the MSB (sign bit) of the XOR result matters. |MSB = 1 (Result < 0)|Signs differ |(Positive ^ Negative = Negative)| |---|---|---| |MSB = 0 (Result >= 0)| Signs match |(Positive ^ Positive OR Negative ^ Negative = Positive)| #### Question: Why use (abs(a)>>14) to check overflow? Our goal is to determine if the dividend `a` is significantly larger than the divisor `b`, which would cause the division to result in overflow. In Q16.16 fixed-point arithmetic, the division is defined as: $$Result = \frac{A \times 2^{16}}{B}$$ ##### Why is this scaling necessary? > Since both `a` and `b` are already scaled by $2^{16}$, dividing them directly causes the scaling factors to cancel out ($frac{2^{16}}{2^{16}} = 1$), reverting the result to a standard integer. To maintain the Q16.16 format, we must explicitly re-apply $2^{16}$ scaling factor to the numerator. In other word, the formula for fixed-point divsion Q16.16 is: $$ Result = \frac{A \times 2^{16}}{B} $$ To prevent overflow, we require the result to fit within a signed 32-bit integer ($2^{31}$): $$Result < 2^{31} (MAXINT)$$ Namely: $$ \frac{|A| \times 2^{16}}{|B|} < 2^{31} $$ We can simplify this by dividing both sides by $2^{16}$: $$ \frac{|A|}{|B|} < \frac{2^{31}}{2^{16}} $$ $$ \frac{|A|}{|B|} < 2^{15} $$ $$ |A| < |B| \times 2^{15} $$ Equivalently, we can quickly check for overflow by testing whether $(|A| >> 15) < |B|$. **Conclusion**: Theoretically, as long as `(abs(a) >> 15)`, the division is safe. #### Question: Why does DOOM use 14 instead of 15? This is a strategic engineering decision to prioritize stability over range, focusing on **Sign Bit Protection**. > The theoretical limit `>>15` allows result up to $2^{31} - 1$ (`0111....111`). If the result exceeds this by even 1 due to calculation errors, the bit pattern becomes `1000....`. In signed integers, this flips the **Sign Bit**, causing value to wrap around from **MAX_INT** to **MIN_INT** instantaneously. > > Since 'FixedDiv2' uses `double` internally, converting back to 'int' near the boundary is risky. A tiny precision error could push the value over the edge, triggering the sign flip. > > By using `>>14`, the code caps the result at $2^{30}$(approx. 1 billion). This creates a massive buffer zone before reaching the $2^{31}$(approx. 2 billion), guaranteeing that the **Sign Bit** is never accidentally touched. --- FixedDiv2 is where the actual calculation happens. #### 1. Disabled version (integer-only implementation) ```c= #if 0 long long c; c = ((long long)a<<16) / ((long long)b); return (fixed_t) c; #endif ``` In this disabled block, `a` and `b` are first type-casted to `long long` to prevent overflow. Then, `a` is left-shifted by 16 bits (`a << 16`, which is equivalent to multiplying by $2^{16}$) and divided by `b`. Finally, the result is cast back to `fixed_t`. ##### Question: Why Type Casting? Without type casting, shifting a left by 16 bits will essentially multiply it by 65,536. This intermediate value often exceeds the 32-bit integer limit, causing an immediate overflow. Therefore, casting to long long is necessary. **Note:** It is crucial to note that even after casting to `long long`, the **order of operations** still matters. If you divide before shifting, as shown below: ```c long long c; c = ((long long)a / (long long)b) << 16; return (fixed_t) c; ``` This code fails because `a/b` performs **integer division** first. This immediately truncates and discards the fractional part, leaving only the integer quotient. Shifting the result afterwards is meaningless as the precision has already been lost. Let's take $a = 1.0$, $b = 2.0$ in Q16.16 format: $A = 1.0 \times 65536 = 65536$(`0x00010000`) $B = 2.0 \times 65536 = 131072$(`0x00020000`) **CASE 1: The Wrong Way** ```c c = ((long long)a / (long long)b) << 16; ``` - Step 1: `65536 / 131072` (Integer division results in `0`) - Step 2: `0 << 16` (Result is still `0`) - Result: `0.0` (Incorrect) **CASE 2: The Right Way** ```c c = ((long long)a<<16) / ((long long)b); ``` - Step 1: `0x00010000 << 16` = `4294967296`(`0x100000000`) - Step 2: `4294967296 / 131072` = `32768` - Result: `32768` (`0x8000`) corresponds to **0.5** in Q16.16. (Correct) <p style="text-align:center;"> <img src="https://hackmd.io/_uploads/rk1pOcbzWl.png" width="500"> </p> #### 2. Version in use (floating-point implementation) ```c= double c; c = ((double)a) / ((double)b) * FRACUNIT; if (c >= 2147483648.0 || c < -2147483648.0) I_Error("FixedDiv: divide by zero"); return (fixed_t) c; ``` ##### Question: Why Type Casting? First, using direct integer division (a / b) would result in a loss of fractional precision. To avoid this, both values are cast to double for the division, and the result is then scaled back into the Q16.16 format.Since a 32-bit signed integer is limited to the range of $-2,147,483,648$ to $2,147,483,647$, exceeding these bounds causes an overflow, making a boundary check necessary. Notably, although the error message reads 'divide by zero', DOOM uses this same message for both division-by-zero and overflow conditions. ##### Question: Why use double to process fixed-point number? The reason for using the `double` data type to process fixed-point numbers during the development of DOOM (1993) is rooted in the C language standard of the time. At that point, the prevailing C standard was C89 (or [ANSI C](https://zh.wikipedia.org/zh-tw/ANSI_C)), which did not support the `long long` integer type. The only standardized data type guaranteed to handle 64-bit or greater precision was `double`. Consequently, the DOOM development team adopted `double` as a practical workaround (or stopgap measure) to manage their 32.32 fixed-point values. The native support for the `long long` integer type was only introduced later with the [C99](https://zh.wikipedia.org/zh-tw/C99#%E5%8F%83%E8%80%83%E8%B3%87%E6%96%99) standard. To ensure the code could be compiled on all C89-compliant compilers of the time, the developers avoided using any `long long`–based implementation. ##### Question: If the target hardware lacks an FPU, won't using `double` cause performance issues? Yes, it would—if DOOM actually relied on `double` at runtime. However, the `double`-based code in `m_fixed`.c exists only as a **portable C fallback**, ensuring the engine can be compiled on any C89-compliant compiler, including those without a `long long` type. In the actual shipped MS-DOS version, these critical arithmetic functions were implemented in *Assembly*. This allowed the engine to use CPU registers directly for integer math, completely bypassing the FPU and avoiding slow software emulation. ##### Question: Why is the order of operations not an issue here? Unlike the integer implementation, `a` and `b` are cast to `double` **before** the division. In floating point arithmeic, division preserves the fractional part (e.g., $1.0/2.0 = 0.5$), whereas integer division would truncate the result to zero immediately. ## References 1. [Fixed-point arithmetic — *Wikipedia*](https://en.wikipedia.org/wiki/Fixed-point_arithmetic#:~:text=In%20computing%2C%20fixed%2Dpoint%20is,values%20as%20multiples%20of%20$1000) 2. [Q(number format) - *Wikipedia*](https://en.wikipedia.org/wiki/Q_(number_format)) 3. [Term project: Optimize 2D line drawing for RV32IM using fixed-point arithmetic](https://hackmd.io/@maromaSamsa/HkjefPbFs) 4. [Fixed Point Arithmetic - *Cornell University ECE4760*](https://people.ece.cornell.edu/land/courses/ece4760/PIC32/index_fixed_point.html) 5. [Floating Point Numbers | Fixed Point Number vs Floating Point Numbers](https://www.youtube.com/watch?v=zVM8NKXsboA) 6. [Dynamic Range - *Wikipedia*](https://en.wikipedia.org/wiki/Dynamic_range) 7. [Getting GCC to generate the SH2 Assembly for Q16.16 fixed-point multiplication](https://segaxtreme.net/threads/getting-gcc-to-generate-the-sh2-assembly-for-q16-16-fixed-point-multiplication.25530/) 8. [Off-by-one error - *Wikipedia*](https://en.wikipedia.org/wiki/Off-by-one_error) 9. [Catastrophic cancellation - *Wikipedia*](https://en.wikipedia.org/wiki/Catastrophic_cancellation) 10. [Neon - *ARM*](https://www.arm.com/technologies/neon) 11. [MVE - *ARM*](https://www.arm.com/zh-tw/architecture/cpu/m-profile) ## Acknowledgment This article was proofread and refined with the assistance of AI tools: 1. [Microsoft Copilot](https://copilot.microsoft.com/shares/EUfnnS7yZbcbouBa7Tmtp) 2. [ChatGPT](https://chatgpt.com/share/68e50e06-5774-8006-a565-f8aa6f3dbeba) 3. [ChatGPT](https://chatgpt.com/share/6931500c-e5ec-8005-9742-873e7c2a8957)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully