A brief note on SIMD

# A brief note on SIMD This is a brief technical note of SIMD architecture based of what I learned from **Computer Architecture** and **Computer Organization and Design** books. As well as serve as the long note of my previous note. ![parallelism](https://hackmd.io/_uploads/SJorxgtsge.png =40%x) While CPU and GPU both has SIMD implementation, they actually has plenty of difference and mechanism. For CPU itself, there is the Multimedia Extension and Vector architecture. ## 1. CPU ### Multimedia Extension, x86 with SIMD The focus of Multimedia Extension is having multiple execution unit and running it simultaneously to exploit *Data-Level Parallelism*/DLP. >The original motivation behind SIMD was to amortize the cost of the control unit over dozens of execution units. Another advantage is the reduced instruction bandwidth and space—SIMD needs only one copy of the code that is being simultaneously executed, ... >-- **Computer Organization and Design** p.500 ![simd](https://hackmd.io/_uploads/Hk3degFjxg.png =70%x) -- Scalar (left) vs SIMD (right), taken from https://www.c-sharpcorner.com/ x86 has *Multimedia Extension* (MMX), *Streaming SIMD Extensions* (SSE) and now *Advanced Vector Extensions* (AVX). Despite the naming, AVX is not a vector architecture, becareful on that. With these extensions, it is possible to only give one instruction and have bunch of data being executed simultaneously. Of course within the capabilities of the hardware. >... single instruction multiple data (SIMD) is potentially more energy-efficient since a single instruction can launch many data operations. >-- **Computer Architecture** p. 282 A CPU with AVX2 has 256-bit long registers to accomodate simultaneous execution of four 64-bit floating-point numbers or eight 32-bit floating point, CPU with AVX-512 has double that and even support more data type. The long register is useful to hold multiple data before being sent to the execution unit. Apparently this SIMD architecture is also similar to Cerebras's WSE-3 core. ![wse3](https://hackmd.io/_uploads/S1OcleYsex.png =70%x) ### Vector Architecture Vector instruction is at least 30 years older than Multimedia Extension and GPU, yet it is not a common thing until recently part of it is due to sense of urgency. >... but they (vector architecture) were considered too expensive for microprocessors until recently. >-- **Computer Architecture** p. 282 The problem with vector architecture is that they are complex, will make memory feels abused, and expensive to produce. >Part of that expense was in transistors, and part was in the cost of sufficient dynamic random access memory (DRAM) bandwidth,... >-- **Computer Architecture** p. 282 These problems can be managed with **pipelined execution unit** which is the key feature in vector architecture and memory hierarchy. Pipelining is a technique to trade-off between area and speed/timing. Assuming that the operation can be broken down into steps/stages. (in general less transistor = less area) Click below for example of pipeline technique (optional): :::spoiler **Pipelining for dummies** Machine X is a washing + drying machine takes 1 unit of area. It finishes in 40 minutes. Steps in Machine X can be broken down into Machine Y and Machine Z Machine Y is a washing machine takes 0.7 unit of area. It finishes in 20 minutes Machine Z is a drying machine takes 0.7 unit of area. It finishes in 20 minutes 4 people, A,B,C,D wanted to do laundry. Using one machine X (1 basic hardware), it will take 160 minutes to finish and take 1 unit of area. Using two machine X (2 Parallel hardware/SIMD), it will take 80 minutes to finish and take 2 unit of area. Using Machine Y and Machine Z (Pipelining/Vector), it will take 100 minutes to finish and take 1.4 unit of area. How? Using a combination of Machine Y and Z they take turns. |Time (hr:min)| Condition | |-|-| |00 : 00 |A washes their clothes| |00 : 20 |A finishes washing, A put their clothes in machine Z, B starts washing with machine Y| |00 : 40 |A finishes drying, B put their clothes in machine Z, C starts washing with machine Y| |01 : 00 |B finishes drying, C put their clothes in machine Z, D starts washing with machine Y| |01 : 20 |C finishes drying, D put their clothes in machine Z| |01 : 40 |D finishes drying| Machine Y and Z represents a broken down operation. Not as fast as 2 Machine X (2 parallel hardware) but still faster than 1 basic hardware and does not take too much space/area. Still don't understand? No worries, Google is best fren. ![pipelining](https://hackmd.io/_uploads/ByoQQvGDxg.jpg) -- Illustration on how 4-bit addition can be broken down into 4 stages, taken from [this paper](https://ieeexplore.ieee.org/document/6802427) ::: Just like Multimedia Extension, Vector Architecture can increase the number of execution unit to increase peak throughput, it is called adding more vector lanes. This adds modularity thus adding more options. >Adding multiple lanes is a popular technique to improve vector performance as it requires little increase in control complexity and does not require changes to existing machine code. > -- **Computer Architecture** p. 293 ![Adding more execution unit](https://hackmd.io/_uploads/H1ETxgYsle.png =70%x) -- Adding more execution unit for even faster execution, **Computer Architecture** p. 294 Below are some of the key mechanisms in vector architecture that I will mention but will not explain because I'm lazy (google is a friend): 1. Chaining 2. Strip Mining 3. Vector-mask control (handle IF statement inside loop) 4. Gather-scatter 5. Stride and non-unit stride 6. Memory banks to supply memory bandwidth :::info Pipelining increases throughput but it also increases latency/execution time. ::: ### Multithread More threads = more execution unit. Multiple threads can share execution unit in an overlapping fashion so it can utilize the hardware resources efficiently. Variations of multi-threading, more details in *Computer Organization and Design* p. 506: 1. Fine-grained multithreading 2. Coarse-grained multithreading 3. Simultaneous multithreading (SMT) ### Multicore More cores = more execution unit. Multicore processors have an interconnect for each core to communicate with other core and memory. >As processors operating in parallel will normally share data, they also need to coordinate when operating on shared data; otherwise, one processor could start working on data before another is finished with it. >-- **Computer Organization and design** p. 510 The typical design is a *Shared Memory multi-Processor* (SMP) which has a single physical address space across all processors. This comes in two styles : 1. Uniform Memory Access/UMA The latency to memory does not depend on which core that asks for it 2. Non-Uniform Memory Access/NUMA Some cores's memory access are much faster than others This is important as multicore processor will have to coordinate when doing things in parallel. This coordination is called **synchronization**. ## 2. GPU ### Floating-Point Processing Unit At first GPU was designed as a hardware accelerator for graphical stuff. GPUs using their corresponding drivers are designed to implement OpenGL, DirectX and other well-defined graphic *application programming interfaces*/APIs and do graphics processing such as geometry, vertex and pixel processes. :::info Graphic APIs are an abstraction layer that bridge programmers and computer architech. Programmers can focus on making programs based on the available function in the APIs, while hardware designers can focus on designing a hardware so that the functions in the API can be executed very fast. At least compared to CPU. ::: ![image](https://hackmd.io/_uploads/Sk8lWlYill.png) -- Basic graphics algorithm, **Computer Organization and Design** p. B-10 ![image](https://hackmd.io/_uploads/r1EbbxYsxx.png) -- Basic GPU core, **Computer Organization and Design** p. B-10 Historically, computing using GPU was not a thing until some random people got h0rny looking at the number of floating-points execution units available in the GPUs and they wanted to use it to speed up computations way faster than CPUs could. The massive number of SIMD lane in GPU is the reason people use it to exploit DLP. >The large amount of floating-point processing power in the GPU processor array is very attractive for solving nongraphics problems. >-- **Computer Organization and Design** p. B-5 ![image](https://hackmd.io/_uploads/r1iM-ltsgx.png) -- Basic Nvidia GPU architecture, **Computer Organization and Design** p. B-11 ![image](https://hackmd.io/_uploads/SycQWxYixg.png) -- Nvidia Pascal architecture, **Computer Architecture** p. 330 >As GPUs have evolved superior floating-point performance and very high streaming memory bandwidth for real-time graphics, they have attracted highly parallel applications beyond traditional graphics. >-- **Computer Organization and Design** p. B-17 At first *General Purpose computation on GPU*/GPGPU was a thing. Ever heard of that term? The difference from today's GPU computing is that GPGPU involves programming the GPU to use graphics API and graphics pipeline to perform nongraphics tasks. >access to this power was available only by couching an application as a graphics-rendering algorithm, but this GPGPU approach was often awkward and limiting. >-- **Computer Organization and Design** p. B-17 Short example, calling a function from let's say DirectX or OpenGL to execute a bunch of data for computation maybe physics simulation or data processing for a mathematical model. This is useful if the API has operations that the programmer wanted to do, if not then the GPU can't be used to accelerate those tasks. Then Nvidia released *Compute Unifed Device Architecture* famously known as CUDA. A scalable parallel programming model and software platform for Nvidia GPUs which allows programmers to bypass the graphics API and graphics interfaces of the GPU and simply program in C or C++. (Honorable mention to OpenCL and their successor SYCL, and HIP, may one of you exceed CUDA one day) With CUDA, it became possible to use the GPU as both a graphics processor and a computing processor at the same time. So, it is not a coincidence that GPUs are a popular choice. It is highly available and a lot of people already own them which means it is easy to adopt. #### Special Function Unit (SFU) GPUs also have SFU as seen from the architecture diagram shown before. SFU is a hardware unit that **computes special functions and interpolates planar attributes**, or at least that's what the book says. In computing at least, it is used to accelerate operations such as sine, cosine, exponentials, logarithms, reciprocal, reciprocal square root, etc. Usually these functions are approximated in hardware to balance speed, efficiency and accuracy, usually around 22–24 bits of precision in the result. ![quick math](https://hackmd.io/_uploads/BJ9-aHUPgl.png =35%x) While they are usually done with approximation algorithms, CUDA provides both **slower and more accurate** version and **faster but with low error rate** version. ![image](https://hackmd.io/_uploads/SkOwV0wsxl.png =80%x) -- List of functions and approximations, **Computer Organization and Design** p. B-43 ### How Massive is too massive? Traditional GPUs rely on large-scale multithreading to hide long memory access since their operating sets are huge (hundreds of MBs). So rather than having a huge cache, GPUs prefer having more threads and processing power. ![image](https://hackmd.io/_uploads/rkQHZgtixe.png =80%x) -- Basic GPU memory configuration, **Computer Organization and Design** p. B-22 This means the focus of GPU memory leans toward bandwidth more so than latency. Hardware designers in general have to pick between high bandwidth and short latency, so for GPU it is high bandwidth. >...GPUs specifically do not aim to minimize latency to the memory system. High throughput (utilization efficiency) and short latency are fundamentally in conflict. >-- **Computer Organization and Design** p. B-37 There are workarounds such as adding caches, work coalescing, compression, and *Direct Memory Access*/DMA but it can only help up to a certain degree. This is worsened by the fact that the gap between processing unit and memory is still far. ![image](https://hackmd.io/_uploads/SykPPl8vgx.png) -- Gap between processor and memory improvement, **Computer Architecture** p. 80 For GPUs, adding more threads means more bandwidth requirement. Getting the most powerful GPU means keep adding threads until adding more memory bandwidth becomes too expensive and just does not economically make sense to produce. GPUs also need a fast interconnect as it has to copy a lot of data from CPU's RAM to GPU's VRAM before doing the calculation in GPU. >...CUDA programs must copy data and results between host memory and device memory. >-- **Computer Organization and Design** p. B-24 ## 3. Systolic Architecture ![image](https://hackmd.io/_uploads/HJw3FPuvlx.png) -- Basic Principle of Systolic Architecture, taken from "Why Systolic Architectures?" Systolic architecture is an architecture proposed in 1982, while the implementation itself goes back to 1978. It is comprised of interconnected cells/Processing Elements/PE each capable of performing some simple operation. >By replacing a single processing element with an array of PEs, or cells in the terminology of this article, a higher computation throughput can be achieved without increasing memory bandwidth. >-- "Why Systolic Architectures?" This is my understanding of systolic architecture, it is an architecture where when data needed to be processed multiple times, instead of going back and forth between PE and memory, it is instead sent to another PE. >This is possible for a wide class of compute-bound computations where multiple operations are performed on each data item in a repetitive manner. >-- "Why Systolic Architectures?" The main advantage is that it reduces memory access thus reducing memory bandwidth requirement, as can be seen from the illustration above. >Being able to use each input data item a number of times (and thus achieving high computation throughput with only modest memory bandwidth) is just one of the many advantages of the systolic approach. Other advantages, such as modular expansibility, simple and regular data and control flows, use of simple and uniform cells, elimination of global broadcasting, and fan-in and (possibly) fast response time... >-- "Why Systolic Architectures?" Since high memory bandwidth and low memory latency are two things that are in conflict with each other, it can be chosen by the architect to have lower latency memory although most will choose to maintain high bandwidth and add more computation power instead. ### Systolic Array ![systolic array](https://hackmd.io/_uploads/BkiwZlFjel.png =65%x) -- Systolic array diagram, taken from paper on [HAICA](https://doi.org/10.1007/978-3-031-22677-9_13 "paper's link") ![systolic array dataflow](https://hackmd.io/_uploads/rkKKZeYsgg.png =65%x) -- Systolic array data flow, **Computer Architecture** p. 562 In systolic array, data goes in every cycle instead of all at once (pipelined). Every cycle the data travels according to the data flow diagram above. Each cycle a partial result travels as an input of another cell/arithmetic unit. Since data is reused as it flows, systolic array reduces memory access and is very efficient. It may take several cycles depending on the number of steps, but they are generally way faster than common SIMD accelerator at doing matrix multiplication. Below are examples of optimizations to reduce data movements: 1. Weight stationary 2. Output stationary 3. Input stationary Note that systolic array is good at matrix multiplication, thus other operation has to be tweaked for example multiplying with vector might require the use of identity matrix (this is my opinion, please let me know if I'm wrong). ### An attempt to explain tensor core :::info Some says that accelerators like Tensor Core are based of systolic array architecture. Like [this paper](https://arxiv.org/abs/2502.16851v2) says, although I checked the source they cited and I found no mention of Tensor Core being systolic array. So take it with a grain of salt. ::: ![tensor core](https://hackmd.io/_uploads/SJyWugUwex.png =80%x) -- Nvidia Volta tensor core diagram, taken from [developer.nvidia.com](https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/) Tensor core at least in Volta is divided into 2 stage, product and sum. It is possible that the multiplication part is using a kind of systolic array architecture while the accumulator part uses a bunch of SIMD. Will update this part if I find more information. ### An attempt to explain Intel AMX/XMX ![XMX](https://hackmd.io/_uploads/BJC8nSIPxx.png) -- Diagram for Intel XMX XMX might be of different one from Systolic Array. It is probably still based of systolic architecture as it is quite similar to a paper Intel launched before. ![image](https://hackmd.io/_uploads/SkQ1nHIDxg.png =85%x) -- Diagram of Intel *Dot Product Accumulate Systolic*/DPAS DPAS is a element wise multiply add and accumulate operation of multiple elements in a systolic pipeline with low precision (<= 8 bits) inputs. From my understanding, the PE consists of multiplication SIMD and addition execution unit, then chain them in a 1D-systolic arrangement. Will update this part if I find more information. ## Sources Hennessy, J. L., & Patterson, D. A. (2019). Computer architecture: A quantitative approach (6th ed.). Morgan Kaufmann. Hennessy, J. L., & Patterson, D. A. (2018). Computer Organization and Design (RISC-V ed.). Morgan Kaufmann. Kung, "Why systolic architectures?," in Computer, vol. 15, no. 1, pp. 37-46, Jan. 1982, doi: 10.1109/MC.1982.1653825. keywords: {Costs;Hardware;Assembly systems;Computer architecture;Data flow computing;Concurrent computing;Application software;Signal processing;Image processing;Computer interfaces},