Bunch of Accelerators Card'/BoAC, An Over-Simplified Retrospective

# Bunch of Accelerators Card'/BoAC, An Over-Simplified Retrospective :::info :bulb: This note will be talking mostly about the rise of accelerated computing and a bit on GPU acceleration amidst the AI boom. ::: I was watching this year's (2025) GPU launch events when I noticed how GPUs now are full of accelerators, are they not? For example, RDNA4 from Radeon and Blackwell from Nvidia. ![RDNA4 overview](https://hackmd.io/_uploads/S1UFVudwgl.jpg =80%x) -- AMD Radeon RDNA4 slide ![blackwell](https://hackmd.io/_uploads/SkskBOdPge.png =80%x) -- Nvidia Blackwell slide Of course the most important part is the part that define the hardware as *Graphics Processing Unit*. A car is called a car because it functions as a vehicle, other functions are just there either to improve *quality of life*/*user experience* or just pure marketing gimmick. But apart from the parts that are used to process graphics, there are components that do not serve as *Graphic Processor*, for example: :black_square_button: AI Accelerator/Tensor Cores :black_square_button: Ray Tracing Accelerator/RT Cores :black_square_button: Video Encoder/Media Engine :black_square_button: Video Decoder/Media Engine :::success :speaking_head_in_silhouette: I put video encoder and decoder separately because some GPUs only come with one but not with the other. ::: ![state of 2025 hardware](https://hackmd.io/_uploads/BkTkZ4zPgg.png =40%x) -- GPUs in 2025 hardly comes without AI accelerator Some people call these accelerators with their technical term, *Domain-Specific Accelerator* (DSA). Hardware that are designed to accelerate a task in a certain domain/area. I will interchange the use of the term DSA and accelerator for this note :pray:. Well in my opinion, GPUs themselves qualify to be called as accelerators. Mainly because they are made to accelerate graphics processing. CPU can perform rendering, but not faster than the latest graphics card. >GPUs are accelerators that supplement a CPU, so they do not need to be able to perform all the tasks of a CPU. -- **Computer Organization and Design** p.514 I then ask myself, is increasing the number of accelerator in GPU means that it is just a bunch of accelerators in the end? This question sends me down the rabbit hole (a shallow one that is) of why computer architecture today is the way it is, how do they do it, what lies behind the latest breakthroughs? Now it is time for me to tell the tale, starting from chapter 1. (feel free skipping to summary for instant answer) ## 1. Why use GPU for computation? Would you look at that, there is a book that can answer that: > For a few hundred dollars, anyone can buy a GPU today with hundreds of parallel floating-point units, which makes high-performance computing more accessible. > -- **Computer Organization and Design** p. 514 People wanted to do computation cheaply and GPU enables that. *the end* Jokes aside, here's the real chapter 1. (hope that's a good icebreaker :kiss:) ## 1. AI Boom, The demand of faster hardware GPUs : "Yup thats me, you're probably wonder how I ended up in this situation. Dun, dun, dun..." ### AI boom and demand for computation >Artificial intelligence (AI) has made a dramatic comeback since the turn of the century. Instead of building artificial intelligence as a large set of logical rules, the focus switched to machine learning from example data as the path to artificial intelligence. The amount of data needed to learn was much greater than thought. >-- **Computer Architecture** p. 546 >We also underestimated the amount of computation needed to learn from the massive data, ... >-- **Computer Architecture** p. 546 The boom in AI in recent years has been tremendous. The hype can be seen in the increase of paper quantity as seen from the graph below: ![image](https://hackmd.io/_uploads/r1tVc5nUxx.png =60%x) -- number of papers on ML and AI, taken from [arxiv](https://arxiv.org/pdf/2210.00881 'link for the paper') And it seems not only quantity, but also increase in quality and complexity as well. Here's data from nature, one of the most prestigious journals in the world. ![parameters](https://hackmd.io/_uploads/BkBun9nUll.png =80%x) -- Number of parameters, i.e., weights, in recent neural networks. Taken from [nature](https://www.nature.com/articles/s41598-021-82543-3) As mentioned by [nature](https://www.nature.com/articles/s41598-021-82543-3), larger models tend to require more computation power. The reason for the increasing model size and complexity is linked to the increasingly complex problems required to be solved by AI. >... efforts have been made to reduce DNN sizes, but there remains an exponential growth in model sizes to solve increasingly complex problems with higher accuracy. >-- [nature](https://www.nature.com/articles/s41598-021-82543-3) So in general, there is this hype on AI that increased the demand in both number and complexity of computation. The simplest solution is basically to have more computation power, right? :::success The answer is might not be as simple as buying more hardware like a certain someone with a cool looking leather jacket once said "The more you buy, the more you save". ::: ### AI is Math (optional) Click below for explanations of difficult words that is hard to understand. :::spoiler ![image](https://hackmd.io/_uploads/Hkq57hVDxg.png) -- Vocabulary list, **Computer Architecture** p. 545 ::: So, I will take DNN as the first example since it is the most popular one. I will put diagrams of typical types of neural network and see what similarities they have. ![image](https://hackmd.io/_uploads/SkZZuWHwge.png =60%x) -- The most basic DNN diagram, **Computer Architecture** p. 549 ![image](https://hackmd.io/_uploads/BJV6YZrPel.png =60%x) -- The most basic CNN diagram, **Computer Architecture** p. 551 ![image](https://hackmd.io/_uploads/r1xMcWSwlx.png =60%x) -- The most basic LTSM diagram (most popular RNN), **Computer Architecture** p. 555 Oh boy, there are plenty of matrix operations involved. >Even this quick overview suggests that DSAs for DNNs will need to perform at least these matrix-oriented operations well: vector-matrix multiply, matrix-matrix multiply, and stencil computations. >-- **Computer Architecture** p. 556 Apart from these "outdated" AI algorithms, I will show a bit of transformers which is the heart of the latest AI (LLM and LRM). ![image](https://hackmd.io/_uploads/r1RaXzHwxe.png =50%x) -- Diagram of Transformers, taken from [arXiv:1706.03762](https://arxiv.org/abs/1706.03762 "Transformers paper link") Transformers consists of encoder (left) and decoder (right) part. Some algorithm only has encoder, some only decoder. The point is, they consist of N stacks (the paper mentions 6 stacks) of encoder and/or decoder. ![image](https://hackmd.io/_uploads/BJnLEfBwlg.png =30%x) -- Diagram of Multi-Head Attention, taken from [arXiv:1706.03762](https://arxiv.org/abs/1706.03762 "Transformers paper link") Each stack has at least one Multi-Head Attention which consists of *h* number attention layers running in parallel. ![image](https://hackmd.io/_uploads/Sy05VzBwex.png =30%x) -- Diagram of the attention layer, taken from [arXiv:1706.03762](https://arxiv.org/abs/1706.03762 "Transformers paper link") There are more than one Matrix Multiplication(Matmul) operation in each attention layer. Matrix Multiplication is quite a demanding computation. (People familiar with GEMM know how intensive this workload is). This means, those AI girlfriends/boyfriends, AI waifus/husbandos, are actually just computers doing what they do best, math. ### Chapter 1 : Conclusion ![lmao](https://hackmd.io/_uploads/BkO6_ZHDlx.png =50%x) -- "OOF" -- author I can at least say that in theory, these AIs are just a bunch of Math Operations (mostly matrix operations). The term for this I believe is called **Data Crunching** or some also called it **Number Crunching**, and that's what running AI is. (Of course each algorithm and model varies a bit, some are more memory intensive than others but the general idea is the same. ) Thus the explanation for the surge of high performance microprocessors and chips these recent years, might be contributed by the increase of the hype in AI. In economics, "supply keeping up with demand" is a normal phenomenon. What's abnormal is this surge in supply comes with a lot of cringe things attached to this hyped market that kept demand high. For example a trend where any hardware that can do ~~AI~~ computation is now called "AI Hardware", "AI processor", "AI PC", "AI Chips", AI This AI That, bruh. ![Pepe_No_AI](https://hackmd.io/_uploads/BJs5yWaUlx.png =5%x) Tired of reading the word "AI" yet? :rolling_on_the_floor_laughing: Anyway, since this is a technical note the next chapter is more about how does hardware deal with data crunching problems and how it leads to using GPU. ## 2. Massive Parallel Hardware ### Allegory of the Farm Imagine a cotton farm with 4 tiles has a worker :farmer:. |:farmer:|Tile| |--|--| |Tile|Tile| It takes 2 hours of work for 1 worker to finish harvesting a tile. To speed up the process, there are 2 choices : - :medal: Make the worker work faster:medal: The worker can only be pushed to work 2x faster, more than that then he/she would die of exhaustion. This means the farm will be harvested in 4 hours. - :medal: Hire more worker:medal: Hiring 4 workers means 4x faster. This means the farm will be harvested in 2 hours (4 workers). |:farmer:|:farmer:| |--|--| |:farmer:|:farmer:| :::info I know this example might be very simple but bear with it. ::: There is a limit in pushing the worker, so the fastest way to finish harvesting the farm is to hire 4 workers. These 4 workers are doing different tiles and are dependent of each other, thus they work in **parallel**. This represents a real life condition : - The worker represents a single functional unit/execution unit i.e an adder/multiplier/divider/subtractor - The tile represents data that needed processing There is a limit in squeezing the performance of a single execution unit (search for **the end of Dennard scaling** and **the slowdown of Moore's Law** for more context). At the time of writing this article, the fastest and most energy efficient way is to add several efficient circuit and run them in parallel or in other words, take advantage of **parallelism**. >Replacing large inefficient processors with many smaller, efficient processors can deliver better performance per joule both in the large and in the small, if software can efficiently use them. >-- **Computer Organization and Design** p. 492 This type of parallelism where a single command/instruction is given to multiple data is called *Data-Level Parallelism*/**DLP**. And it is what people usually refer to as being/running "parallel". Another type of parallelism is called *Task-level parallelism*/TLP in which I will not discuss in this note. More details from *Computer Architecture* p. 10 (available below, optional): :::spoiler Taken from *Computer Architecture* p. 10: **Classes of Parallelism and Parallel Architectures** Parallelism at multiple levels is now the driving force of computer design across all four classes of computers, with energy and cost being the primary constraints. There are basically two kinds of parallelism in applications: 1. *Data-level parallelism* (DLP) arises because there are many data items that can be operated on at the same time. 2. *Task-level parallelism* (TLP) arises because tasks of work are created that can operate independently and largely in parallel. Computer hardware in turn can exploit these two kinds of application parallelism in four major ways: 1. *Instruction-level parallelism* exploits data-level parallelism at modest levels with compiler help using ideas like pipelining and at medium levels using ideas like speculative execution. 2. *Vector architectures, graphic processor units (GPUs), and multimedia instruction* sets exploit data-level parallelism by applying a single instruction to a collection of data in parallel. 3. *Thread-level parallelism* exploits either data-level parallelism or task-level parallelism in a tightly coupled hardware model that allows for interaction between parallel threads. 4. *Request-level parallelism* exploits parallelism among largely decoupled tasks specified by the programmer or the operating system For this note, The focus will be *Vector architectures, graphic processor units (GPUs), and multimedia instruction sets*. ::: ### Exploiting Data-Level Parallelism/DLP (optional) **Applying a single instruction to a collection of data and executing it in parallel**, that's the point in taking advantage of DLP. This problem can be solved using the concept of *Single Instruction Multiple data*/SIMD, a concept from **Flynn's Taxonomy**. Click below for detailed explanation of Flynn's Taxonomy, (optional) :::spoiler In 1966, someone named Flynn made a study on parallel hardware and published a categorization of parallel hardware based on amount of instruction vs data: ![image](https://hackmd.io/_uploads/SJAsHpiUll.png) -- **Computer Organization and Design** p.499 Taken from **Computer Architecture** p.11: When Flynn (1966) studied the parallel computing efforts in the 1960s, he found a simple classification whose abbreviations we still use today. They target data-level parallelism and task-level parallelism. He looked at the parallelism in the instruction and data streams called for by the instructions at the most constrained component of the multiprocessor and placed all computers in one of four categories: 1. **Single instruction stream, single data stream (SISD)**—This category is the uniprocessor. The programmer thinks of it as the standard sequential computer, but it can exploit ILP. Chapter 3 covers SISD architectures that use ILP techniques such as superscalar and speculative execution. 2. **Single instruction stream, multiple data streams (SIMD)**—The same instruction is executed by multiple processors using different data streams. SIMD computers exploit data-level parallelism by applying the same operations to multiple items of data in parallel. Each processor has its own data memory (hence, the MD of SIMD), but there is a single instruction memory and control processor, which fetches and dispatches instructions. Chapter 4 covers DLP and three different architectures that exploit it: vector architectures, multimedia extensions to standard instruction sets, and GPUs. 3. **Multiple instruction streams, single data stream (MISD)**—No commercial multiprocessor of this type has been built to date, but it rounds out this simple classification. 4. **Multiple instruction streams, multiple data streams (MIMD)**—Each processor fetches its own instructions and operates on its own data, and it targets task-level parallelism. In general, MIMD is more flexible than SIMD and thus more generally applicable, but it is inherently more expensive than SIMD. For example, MIMD computers can also exploit data-level parallelism, although the overhead is likely to be higher than would be seen in an SIMD computer. This overhead means that grain size must be sufficiently large to exploit the parallelism efficiently. I cut the parahraph on MIMD because it is too long, more details from the book. However for this note, this is more enough. ::: Exploiting DLP is a trick that is used when dealing with data crunching. :::warning This is a long and complicated topic so I will not be including most of the technical information in this note. Read my note [A brief note on SIMD](https://hackmd.io/@Grim5th/Sy8bA0Swxx) for more technical information. ::: ### Data Crunching in CPUs To execute multiple data using a single instruction, x86 processors usually have **Multimedia Extensions** (MMX, SSE and AVX) while RISC-V and ARM can either have SIMD (mimicking Multimedia Extensions) or **Vector Architecture**. They are quite different in how they works, but they both **process a bunch of data with just a single instruction** and they both have large registers to accomodate it. >SIMD works best when dealing with arrays in for loops. Hence, for parallelism to work in SIMD, there must be a great deal of identically structured data, which is called data-level parallelism. >-- **Computer Organization and Design** p.500 #### Brief note on Multimedia Extensions (optional) ![image](https://hackmd.io/_uploads/HyTDVT0cxx.png =60%x) -- Illustration of Multimedia Extensions, taken from [dev.to](https://dev.to/mstbardia/simd-a-parallel-processing-at-hardware-level-in-c-42p4) Multimedia Extensions and other SIMD extensions like the one in Cerebras's WSE-3 (RISC-V) focus on having multiple execution unit so that multiple of data can be executed at once. In a way improving throughput, though it will not improve latency. It also reduces instruction bandwidth and code size since all it needs is one instruction. For example, AVX512 has 512-bit long registers so it can process 16 FP32 numbers or 32 FP16/BF16 numbers all at once with only a single instruction. >... reduction in instructions fetched and executed saves energy. >-- **Computer Organization and Design** p.502 #### Brief note on Vector architecture (optional) In contrast with Multimedia Extension and what some call as **SIMD**, Vector architecture focuses on using **pipelined execution unit**. Pipelining increases throughput but will also increase latency. Then what's the advantage over "SIMD"? Pipelining saves chip area. A single pipelined execution unit might have the same throughput as two non-pipelined execution unit while requiring less chip area. Although focuses on a pipelined execution unit, it is also possible to stack a bunch of this pipelined execution unit to achieve higher throughput. (note that AVX is a multimedia extension and not a vector architecture despite its naming) ![image](https://hackmd.io/_uploads/ByWzOTA9gg.png =60%x) -- Stacking vector execution unit, **Computer Organization and Design** p.504 Explanation for pipelining (optional): :::spoiler **Pipelining for dummies (me)** Machine X is a washing + drying machine takes 1 unit of area. It finishes in 40 minutes. Steps in Machine X can be broken down into Machine Y and Machine Z Machine Y is a washing machine takes 0.7 unit of area. It finishes in 20 minutes Machine Z is a drying machine takes 0.7 unit of area. It finishes in 20 minutes 4 people, A,B,C,D wanted to do laundry. Using one machine X (1 basic hardware), it will take 160 minutes to finish and take 1 unit of area. Using two machine X (2 Parallel hardware/SIMD), it will take 80 minutes to finish and take 2 unit of area. Using Machine Y and Machine Z (Pipelining/Vector), it will take 100 minutes to finish and take 1.4 unit of area. How? Using a combination of Machine Y and Z they take turns. |Time (hr:min)| Condition | |-|-| |00 : 00 |A washes their clothes| |00 : 20 |A finishes washing, A put their clothes in machine Z, B starts washing with machine Y| |00 : 40 |A finishes drying, B put their clothes in machine Z, C starts washing with machine Y| |01 : 00 |B finishes drying, C put their clothes in machine Z, D starts washing with machine Y| |01 : 20 |C finishes drying, D put their clothes in machine Z| |01 : 40 |D finishes drying| Machine Y and Z represents a broken down operation. Not as fast as 2 Machine X (2 parallel hardware) but still faster than 1 basic hardware and does not take too much space/area. Still don't understand? No worries, Google is best fren. ![pipelining](https://hackmd.io/_uploads/ByoQQvGDxg.jpg) -- Illustration on how 4-bit addition can be broken down into 4 stages, taken from [this paper](https://ieeexplore.ieee.org/document/6802427) ::: #### CPU Multithread and Multicore ![image](https://hackmd.io/_uploads/rJoUCuIixl.png =40%x) CPU also employs multithread and multicore architecture, though they are not as extensive as GPUs most area of CPUs have to be used as cache, as can be seen below that cache takes more area than the core itself. ![zen3 dieshot compressed](https://hackmd.io/_uploads/r1YqWkIwee.jpg =80%x) -- Zen3 Die shot More threads = more execution unit = more throughput. At least that's how it is theoretically. ### Integrated Accelerators :::danger This sub-chapter is an addition based of my observation and is not mentioned by any of my sources as far as I know. There might be some bias in this sub-chapter but I will stay or at least try to be credible in the information and data that I will show. ::: ![image](https://hackmd.io/_uploads/SkluIKLieg.png =60%x) -- Illustration of AMD APU from 2013, taken from [tomshardware](https://www.tomshardware.com/news/AMD-HSA-hUMA-APU,22324.html) Putting accelerator in the same packaging/die as the processor is not a new trend, at least in my opinion. I mean, we have processors with integrated GPU for years now, assuming that GPU counts as an accelerator (more of it in chapter 3). So, putting accelerator in CPUs might not be news worthy, the kind of accelerator being put in the processors might be. At least market-hype wise. #### Intel AMX ![intel AMX](https://hackmd.io/_uploads/B1DdfcIoll.png =70%x) -- Slide on Intel AMX, taken from [servethehome](https://www.servethehome.com/intel-golden-cove-performance-x86-core-detailed/) Before Neural Processing Unit (NPU) is a trend, Intel put Advanced Matrix Extensions (AMX) in their processors especially the enterprise products. It was an execution unit that is used to accelerate matrix operations. The focus is to accelerate math operation especially General Matrix Multiply (GEMM), operations that are also used to run AI as I mentioned in chapter 1. Compared to AVX512, it can do more operations per cycle albeit only at matrix operations. More about AMX in [Intel's website](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-amx.html). #### NPUs ![Intel Meteorlake](https://hackmd.io/_uploads/SyhyjFUjxl.jpg =70%x) -- Intel Meteorlake, taken from [Chips and Cheese](https://chipsandcheese.com/p/intel-meteor-lakes-npu) ![AMD Strix Point](https://hackmd.io/_uploads/r1WFefDixe.png =70%x) -- AMD Strix Point, taken from [guru3d](https://www.guru3d.com/story/amd-introduces-ryzen-ai-300-pro-series-with-enhanced-cpu-and-gpu-capabilities/) Neural Processing unit is not a new thing to be put in processors. Even back in 2018, Qualcomm's Snapdragon 670 (yes, the one used in smartphones) got a 3 TOPS NPU or something. But lately x86 Processor companies started putting these so-called AI Accelerator/NPU in the same chip as the CPU core and made it as if it is a new technology, well it is not. What's new is the trend and the hype. Lately it has been been hyped by the marketing of Microsoft's (allegedly failed) Copilot with the tagline "AI PC" and "AI Processors". ### Chapter 2 : Conclusion As mentioned in Chapter 1, running AI requires plenty of data crunching. The techniques I mentioned above are the general ideas of how CPUs can be designed to speedup those data crunching. Either by having multiple execution unit like SIMD extensions, having vector architecture or even resorting to put accelerators directly in the same die as the CPU itself. The core idea is still the same, achieve high throughput to help with processing a bunch of data or data crunching. I do realize that there are not a lot of techniques that focus on improving latency. I suppose because in most cases the latency is acceptable, so maintaining latency bound is enough. It felt fast because the average number of execution per time is higher. Click below for a brief explanation of latency vs throughput (optional): :::spoiler >Response time also called execution time. The total time required for the computer to complete a task >-- **Computer Organization and Design** p.29 >Throughput also called bandwidth. Another measure of performance, it is the number of tasks completed per unit time >-- **Computer Organization and Design** p.30 In other words, **execution time** is the time between the start and completion of a task and **throughput** is the total amount of work done in a given time. ::: :::danger 1 pregnant woman can deliver a baby in 9 months, but 9 pregnant women are not going to deliver a baby in 1 month although on average it delivers 1 baby per month. ::: I hope this extreme example helps in a way. Anyway, when CPU improvement is not even enough, it is time to put in some dedicated hardware, that's what I will be talking about in the next chapter. ## 3. Dedicated Accelerator Cards ### GPUs Now for the main topic, GPU. This will be quite long, so buckle up. Historically, computing using GPU was not a thing. That is, until some people got h0rny after looking at the high number of floating-points execution units available in the GPUs and decided to utilize in computation and gain higher performance than using CPU to do computation. ![image](https://hackmd.io/_uploads/Syl-y0Doll.png =80%x) -- Basic graphics algorithm, **Computer Organization and Design** p. B-10 ![image](https://hackmd.io/_uploads/BJK17Z_igx.png =80%x) -- Graphics algorithm mapped into hardware, **Computer Organization and Design** p. B-10 Traditionally, GPU was added to accelerate the part of graphics processing algorithm (which I suppose all of them?) that takes too much time. It was designed to implement OpenGL, DirectX and other well-defined graphic APIs. Doing graphics processing such as geometry, vertex and pixel processes. (read about GPGPU for more information) If GPU has more floating-points executions unit than CPU, maybe it can do computations faster than CPUs could. Well indeed it could, the massive number of SIMD lane in GPU is the reason people use it for computation for example running AI. It helps more that Nvidia has a math library called CuBLAS that provides ready to use GPU accelerated math functions. It has the least problems compared to other GPU BLAS library and the most widely adopted by AI Frameworks which might explain why Nvidia has the best compatibility with most AI tools. ![image](https://hackmd.io/_uploads/SyNJQ3Nvlg.png =80%x) -- Nvidia Pascal architecture, **Computer Architecture** p. 330 GPUs rely on the large multithreaded processing unit to do execution. The huge number or threads means a huge number of execution unit, thus GPUs have high average number of execution/cycle to hide memory latency. >...between the time of a memory request and the time that data arrive, the GPU executes hundreds or thousands of threads that are independent of that request. >-- **Computer Organization and Design** p. 514 Unlike CPU that has memory hierarchy to help with memory latency, GPU's memory (usually called VRAM) tends to focus more on bandwidth rather than latency. This means, the chip area that is supposed to be used as cache can instead be used to add more threads hence more performance/area. >...GPUs specifically do not aim to minimize latency to the memory system. High throughput (utilization efficiency) and short latency are fundamentally in conflict. >-- **Computer Organization and Design** p. B-37 Newer GPU usually comes with memory hierarchy to help compensate the lack of memory bandwidth or to help with some load that requires low latency. >...they are thought of as either bandwidth filters to reduce demands on GPU Memory or as accelerators for the few variables whose latency cannot be hidden by multithreading. >-- **Computer Organization and Design** p. 518 GPUs also need a fast interconnect as it has to copy a lot of data from CPU's RAM to GPU's VRAM before doing the calculation in GPU. The way that most CUDA tutorial will teach is that any data stored in the storage is read by the CPU and stored in memory (RAM). Those data are then copied to VRAM for the GPU to access and do computation before sending the result back to the CPU (probably stored in RAM or cache). Hence it is rare to see a system with more VRAM than RAM. >...CUDA programs must copy data and results between host memory and device memory. >-- **Computer Organization and Design** p. B-24 The rule of thumb is that adding GPU threads has to be compensated with adding more memory bandwidth to avoid memory bottleneck. The problem with that is the price of High Bandwidth Memory (HBM). ![Per-Module-Cost-Breakdown-RISCV](https://hackmd.io/_uploads/HyU_m2rPle.jpg =70%x) -- Breakdown of cost of module, taken from [semiwiki](https://semiwiki.com/artificial-intelligence/355281-feeding-the-beast-the-real-cost-of-speculative-execution-in-ai-data-centers/) See the graph above, adding HBM is expensive as hell. So, there is a limit on adding more GPU threads, at least budget wise. If the budget is unlimited, then there's still energy budget and "does the chip has enough perimeter length to put enough lanes to memory?" kind of problem. #### Special Function Unit (SFU) GPUs also have SFU as seen from the architecture diagram shown before. SFU is a hardware unit that **computes special functions and interpolates planar attributes**, well that's what the book says. In computing at least, it is used to accelerate operations such as sine, cosine, exponentials, logarithms, reciprocal, reciprocal square root, etc. Usually these functions are approximated in hardware to balance speed, efficiency and accuracy, usually around 22–24 bits of precision in the result. ![quick math](https://hackmd.io/_uploads/BJ9-aHUPgl.png =35%x) While they are usually done with approximation algorithms, CUDA provides both **slower and more accurate** version and **faster but with low error rate** version. ![image](https://hackmd.io/_uploads/SkOwV0wsxl.png =80%x) -- List of functions and approximations, **Computer Organization and Design** p. B-43 ### Domain Specific Accelerator :::warning The book **Computer Architecture** chapter 7 mentions *Domain-Specific Architecture*/DSA. It is a bit different with Domain-Specific Accelerator which I will call as accelerator from now to avoid confusion. When talking about DSA, the focus is on the architecture and design, while talking about accelerator the focus is on the physical hardware unit. I hope this clears out most of the confusion for those that read the book. ::: #### Integrated in GPU ![image](https://hackmd.io/_uploads/HyfRKCCqll.png =80%x) -- RDNA3 AI Accelerator, AMD Launch Event November 2022 on [youtube](https://www.youtube.com/watch?v=hhwd6UgGVk4) I will quote this from *Computer Architecture* book then explain a bit about accelerator. >Thus the design of GPUs may make more sense when architects ask, given the hardware invested to do graphics well, how can we supplement it to improve the performance of a wider range of applications? >-- **Computer Architecture** p. 310 One of the answer to that question, is to add an accelerator. Accelerators are good, fast and efficient in the thing(s) that they are designed to do. Thus adding more accelerators, means that GPUs are going to perform well in more things as I will quote from the *Computer Architecture* book. >... domain-specific processors that do only a narrow range of tasks, but they do them extremely well. >-- **Computer Architecture** p. 541 Considering the existence of Amdahl's law, it is okay that accelerators can only be good at a few things. It will only be used to accelerate the part that is most significant/takes the most time. > ... only a small, compute-intensive portion of the application needs to run on the DSA in some domains... >-- **Computer Architecture** p. 544 Just like GPUs are designed to accelerate graphic processing, there are architectures designed to accelerate the computation mentioned in chapter 1. >Thus, just as the field switched from uniprocessors to multiprocessors in the past decade out of necessity, desperation is the reason architects are now working on DSAs. >-- **Computer Architecture** p.541 Adding this so called accelerator in the GPU itself only makes sense after seeing the demand in GPU computing. It would be preposterous if the accelerator is added without actually giving benefit, wouldn't it? #### Custom Accelerator Cards Some might think that GPUs are not good enough, or they might want to be dependent of GPU companies. Whatever the reason is, they made a custom chip, a dedicated accelerator. One of the earliest success (famous) is [Google's TPU](https://arxiv.org/abs/1704.04760). ![google TPU](https://hackmd.io/_uploads/rJ22sX_oxl.png =80%x) -- Illustration of TPU and the architecture of the Matrix Multiplication Unit, taken from [TPU paper](https://arxiv.org/abs/1704.04760). It uses systolic array to speedup matrix multiplication. An example of a design aimed to achieve higher power efficiency (performance/watt) than GPU for inference, although performing better than GPU is also a welcomed bonus. Another example would be [Tenstorrent](https://tenstorrent.com/en). Its design is aimed towards affordability by avoiding the need of using HBM which as I mentioned before, is very-very expensive. ![tenstorrent diagram](https://hackmd.io/_uploads/Byf8heFoll.png =80%x) It uses *On-Chip Network*/OCN or *Network On-Chip architecture*/NoC architecture. It also utilize memory hierarchy to further reduce miss penalty and improve data locality. :pencil2: NoC :::spoiler In general NoC is an architecture that uses network topology to connect various components in a chip. ::: Unlike TPU that Google develop mainly for in-house use, Tenstorrent's accelerators are the main commodity that the company sells. There are plenty of other examples out there such as Huawei Ascend, Meta MTIA, Amazon Trainium/Inferentia, Microsoft's Maia, Intel Gaudi, etc. Each one with their own purpose, maybe just following the hype, as a marketing material, or even unusual reason like being banned by a rival country. There are too many to explain one by one but there are plenty out there each with their own quirks, feel free to explore. For those who wanted to design an in-house accelerator but does not want to go through the pain of making their own chip, FPGA is another alternative. ### FPGA ![Xilinx Alveo U280 FPGA](https://hackmd.io/_uploads/BkAOObKixx.png) -- Xilinx Alveo U280 FPGA >The non-recurring engineering (NRE) costs of a custom chip and supporting software are amortized over the number of chips manufactured, so it is unlikely to make economic sense if you need only 1000 chips. >One way to accommodate smaller volume applications is to use reconfigurable chips such as FPGAs because they have lower NRE than custom chips and because several different applications may be able to reuse the same reconfigurable hardware to amortize its costs (see Section 7.5). However, since the hardware is less efficient than custom chips, the gains from FPGAs are more modest. >-- **Computer Architecture** p. 542 Boy, that was a long quote. But as mentioned, in cases like the prototyping stage where the quantity is usually not that high it is more expensive to make custom chips. One example is **Microsoft Catapult**. ![Microsoft Catapult](https://hackmd.io/_uploads/ByOrtS53lg.png =65%x) -- Image of Catapult V2 Board, **Computer Architecture** p.568 A bunch of these FPGAs were setup in a way, each with their own function and circuit with this configuration: - One FPGA does Feature Extraction. - Two FPGAs do Free-Form Expressions. - One FPGA does a compression stage that increases scoring engine efficiency. - Three FPGA do Machine-Learned Scoring. - The remaining FPGA is a spare used to tolerate faults Not only implementing custom circuit, FPGAs also have Digital Signal Processor/DSP to help with computation such as FFT, convolution, addition, etc. Even in the last 5 years, Xilinx has AI Engine/AIE in their DSP to help with AI related computations. The way to look at it, is **using FPGA as an accelerator**. It might feel uncommon but it was a thing back before the GPU and AI hype happens. I still remember after Intel's acquisition of Altera, Intel tried to market the FPGAs as [accelerator cards](https://www.intel.com/content/www/us/en/content-details/649721/intel-fpga-programmable-acceleration-card-d5005-product-brief.html). It was not as popular as GPUs, I believe due to it being focused on enterprise and not consumer market as FPGA may cost a fortune to have. ### Chapter 3 : Conclusion GPUs are a good off-the-shelf solution (although recently the price is more like "oof-the-shelf") to obtain plenty of execution unit (mainly floating-point unit). Although **their main purpose is to do graphics processing**, they can be used to do other computation as well. More so recently as AI Accelerators are embedded inside the GPU itself. FPGA is a good off-the-shelf solution to run a custom design/architecture/system without having to deal with the pain of making a custom chip especially on a small scale/ low quantity implementation. The pain in designing the architecture is still there though. Making or buying a custom specialized chip is a good solution if the system is going to run a specific load like running AI. Keep in mind that making a custom chip comes with its own sets of problems, buying a pre-made custom chips will reduce that headache for a price. Be it GPU, FPGA or custom chips, a dedicated hardware is prefered as they deliver better performance than an integrated solution saving money/time or even both. (so if it does neither, please don't bother buying an accelerator in the first place) As one of my friend said, having a dedicated hardware allows for more room compared to integrated implementation. It can have better energy budget, potentially better performance, better cooler, more suitable memory, etc. Dedicated hardware has its drawback such as a need of additional hardware interface (commonly used is PCIE), increased latency compared to integrated implementation, requires more space, incompatibility with certain hardware, etc. But no matter the implementation is, both have a common problem, which is the need of a programming language/software API (like CUDA or SYCL) to tap the hardware's capabilities. Without a proper API, the hardware's performance will never translate to real performance. >A long-standing fallacy is assuming that your new computer is so attractive that programmers will rewrite their code just for your hardware. > -- **Computer Architecture** p. 544 Now, after all those headache of trying to understand what a GPU is (like really what is a GPU anyway?) and how does it ends up being related closely to AI. Let's wrap this up and back to the question of "should I call a GPU a Bunch of Accelerators Card (BoAC)?". ## :pencil: Summary GPU be like: ![image](https://media1.tenor.com/m/UyGcuf2CxkoAAAAd/homelander-im-better.gif =60%x) Although adding more accelerators means that GPU are going to be faster and better at more things, it still needs CPU to operate as it is limited in the things it can do. >GPUs are accelerators that supplement a CPU, so they do not need to be able to perform all the tasks of a CPU. -- **Computer Organization and Design** p.514 The main purpose of GPU is to do graphic operations. Although lately it differs a bit due to the increasing number of accelerators in the GPUs. Some are quality of life improvements for example video encoders and decoders speed up rendering or streaming. While other additional function is a welcomed one, is it worth the increased price just to support for example these so called "AI accelerators"? While some people including myself might not like it, they are out there wether as a real QoL improvements or a forced justifications. While I don't mind some features (like DLSS4 transformer or FSR4, XeSS), the frame generation kinda thing is not a welcomed one at least for me. And I believe others may or may not agree on that. >... the primary ancestors of GPUs are graphics accelerators, as doing graphics well is the reason why GPUs exist. >-- **Computer Architecture** p. 310 As much as I wanted to call it just a Bunch of Accelerators Card (BoAC), I don't think I can stop calling it a **GPU**. Just because a car can be used to charge a phone, doesn't mean that it stops being called as car now does it? :::info Personally, I will still call them GPUs, at least until they stopped being used as Graphics processors. ::: ## :book: Sources Hennessy, J. L., & Patterson, D. A. (2019). Computer architecture: A quantitative approach (6th ed.). Morgan Kaufmann. Hennessy, J. L., & Patterson, D. A. (2018). Computer Organization and Design (RISC-V ed.). Morgan Kaufmann. Z. Xia, M. Hariyama and M. Kameyama, "Asynchronous Domino Logic Pipeline Design Based on Constructed Critical Data Path," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 4, pp. 619-630, April 2015, doi: 10.1109/TVLSI.2014.2314685 Chen, Z., Zheng, F., Guo, F., Yu, Q., Chen, Z. (2023). Haica: A High Performance Computing & Artificial Intelligence Fused Computing Architecture. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_13 Some of the memes either comes directly or influenced by https://programmerhumor.io/memes/parallel-computing and https://programmerhumor.io/ai-memes They're cool, please check them out :pray: