The Mozart Reuse Exposed Dataflow Processor for AI and Beyond

# The Mozart Reuse Exposed Dataflow Processor for AI and Beyond ###### tags: `Accelerator` [TOC] ## Basic Info - Time: [time=20220915] - Paper Title and Link - [The Mozart Reuse Exposed Dataflow Processor for AI and Beyond](https://research.cs.wisc.edu/vertical/papers/2022/isca22-mozart.pdf#cite.isca2017%3Astream-dataflow) - [Video Link](https://www.youtube.com/watch?v=1uvda5U1gyw) ## Motivation and Problem Definition ::: info TODO - Describe the motivation behind the paper and the problem that this paper tries to solve ::: ### Problem Definition Want to solve the problem occurs when GPU runs the AI/ML workloads, there is some bottleneck, such as data movement, data reuse, and utilization/performance. - **Data movement** - Higher brandwidth - **Data reuse** - Avoid the duplicated memory when accessing the same data on different cores. - A bradcast method to solve this problem ( Request once and send to all tiles ) - **Low Utilization/Perf** on current CPU, GPU, even some static ASIC accelerator - The utilization/perf when CPU, GPU is not primitivly fitted the AI/ML workloads - Think the data reuse, utilization primitively when design the architecture So, they propose a new architecture to expose the 1. Reuse present at other levels of algorithms and also on the batch level parallelism 2. Exposing abstractions close to ML algorithm - but not tied to deep-learning(DL) models. >1. The architecture’s goal is to achieve near-CPUlike exibility while having ASIC-like e ciency for a large-class ofdata-intensive workloads >2. As we looked at the needs of applica-tions, their evolving behavior, and the low utilization/performanceof GPUs, it became clear that exposing reuse present at otherlevels of algorithms could overcome the need for batch-levelparallelism. >3. Second, exposing abstractions close to ML algo-rithms — but not tied to deep-learning(DL) models — would allowarchitecture/application-t even as the applications rapidly evolved.Data reuse is at the heart of ML/DL as data (weights/samples) istransmuted during training or inference. >[name=Paper author] ## Solution Proposal and Result ::: info TODO - Simple version - You can summary the proposed solutions and the evaluation results. - Long version - If you want to keep notes as you read the paper, you can keep your notes in this section. But the most importatant thing is to help others undestand what the paper has done. ::: Tony Nowatzki ### Performance Evaluation <center> <img src="https://i.imgur.com/X8njKl2.png" width=400/> </center> ### Key contributions >- A behavior-oriented ISA design philosophy is critical for staying relevant in the face of rapid DNN algorithm evolution. - They construct a compiler to compile the behavior-oriented code and the IR is full of data information. > - The added programmer burden for accelerator architecturesmake accessible performance specs increasingly important. - Try to minimize the implact on the programmer > - There are surprisingly many dimensions of dataflow machines left to innovate on, including basic microarchitecture, hybridcache/scratchpads with prefetching, interconnection networks,scalable spatial scheduling algorithms, and even just datapath modules. > - Accurate early (pre-RTL) power estimates for novel microar-chitectures could help reduce design-time. > - The software lift needed on non-novel pieces is quite high, thus the ease of software development is becoming a primary driver for future hardware - They are saying a agile development method, such as hardware generator ## Key Takeaways and your comments :::info They proposed a architecture with "reuse exposed dataflow" primitive in different perspect. 1. Execution model 2. ISA Specification ::: ### Execution Model They think the AI/ML or any workloads in a DFG abstraction. They have some terms that represent this DFG( DatafLow Graph ), which represent the data dependancies, and computation instructions. 1. Port: Which is the interface between streams and dataflow. - The stream will connect it with a port, and port will feed it to one or several DFG instructions. Then, after the DFG instructions finishes the computation, it will output the data to a *output port*. 2. DFG instructions <center> <img src="https://i.imgur.com/JgWxqFo.png" width=300/> </center> Then after it defines these instructions, and some abstracted instances. It can then compile the program into a DFG-based IR. <center> <img src="https://i.imgur.com/NK94GCQ.png" width=300/> </center> ### ISA Specfication They think the data reuse, and data movement primitively in ISA specification. They uses two instructions to handle the broadcast mechanism. - Podcast : Read a data for all tiles - Ptethrow: All tiles write a data to a specific address And this two instructions can also be a synchronization or consistency ordering. But they have some pitfalls, these instruction can coordinate the tiles but is also less flexible. These two instructions can save the **address generation overhead**, **cache access**, and **network traffic**. - Config: It is also important to use the `CONFIG` commands to config the CSCA (Circuite-switch compute array) And all ISA is heterogeneous. ### Performance Specification ![](https://i.imgur.com/ZnaKEtx.png) They also think that, in different situation and algorithms, programmer can know the performance specification to tune their program base on some assumption and performance hint. >Multiplying the DFG I/O by reuse ratio (streams abstractionsmake this easier to compute) gives the required bandwidth at eachmemory/network level — immediately revealing the bandwidthbottlenecks. >[name=Author] ### Micro-Architecture #### Overview The image below, is a overall design to implement the SoC ![](https://i.imgur.com/BmlD2mc.png) - 4x16 two-dimensional mesh - Packet switching protocol - 16 DFI-level channel controllers, with 8channel-controllers in each hemisphere, operating at maximumbandwidth of 2000Mbps - EMM (Edge-memory-manager) - HBM to AXI #### HBM controller A hbm controller with dual channels to connect the NIC ![](https://i.imgur.com/HkEKmUM.png) - Orange block is Edge memory manager (HBM to AXI) #### Pod0, & Pod1 They connect in a network based topology ![](https://i.imgur.com/qB9Iti2.png) #### 2D control mesh They also have a control mesh ![](https://i.imgur.com/seaJDJG.png) #### Tile architecture ![](https://i.imgur.com/kIr4PHG.png) - Tile controller Handle the request from the host - Network Controller <center> <img src="https://i.imgur.com/RXLG357.png" width=200/> </center> >The interesting/non-conventional pieces are hardwaresupport structures for performing the podcast and prethrow op-erations. >[name=Author] - Uncore <center> <img src="https://i.imgur.com/Fz7ElQM.png" width=200/> </center> >The uncore is a collection of modules responsible for hold-ing on-tile L2-cache storage, coordinating local and remote tilecache accesses, and synchronization. >**They use theHRF consistency model, with programmers expected to use atomics to achieve cross-tile coherence and writing well-behaved code.** > The uncore also includes the control state machines neededfor podcast and prethrow operations. > Each cache bank further features read andwrite transaction handling registers (TSHRs), enabling up to 16/16inight memory reads/writes in parallel #### Softbrain It consists of a circuit-switch compute array, which ahs similarities to a conventional coarse grain reconfigurable architecture(CGRA). <center> <img src="https://i.imgur.com/sYnmm4c.png" width=400/> </center> #### Stream Engines There are three stream engines - MSE (Memory Stream Engine) - SSE (Scracthpad Stream Engine) - DRE (Data Recurrance Engine) It consists of controller(read, write), command allocator. In the figure below, it can support multiple request in parallel. <center> <img src="https://i.imgur.com/Y4NjR6i.png" width=500/> </center> And the address generator is command based address generator. ![](https://i.imgur.com/AfrzTMD.png) #### Balance Unit <center> <img src="https://i.imgur.com/HfSOcxj.png" width=400/> </center> >A central arbiter and crossbar is responsible for switchingstreams via round-robin mechanism to hide the latency of vector ports fill-drain operation and also to allow multiple streams to b eassigned to MSE, SSE and DRE. #### Operator Scheduler <center> <img src="https://i.imgur.com/8dyvtiF.png" width=400/> </center> ### Evaluation #### PE utilization The PE utilization is very low. <center> <img src="https://i.imgur.com/trg4RV5.png" width=400/> </center> #### Power, Area <center> <img src="https://i.imgur.com/kER3owx.png" width=400/> </center> ## Comment >[name=戴宏諺] >I think they successfully integrated the `data reuse`, `utilization/perf` problem into their philosophy. They defines the execution model as a DFG, I think which is a good point to always abstract the program with not only the a bunch instruction and try to imporve the performance in instruction level, but also try to imporve the program, workload to it bottelneck, memory access, network traffic, and utilization of each PEs. And also uses the **stream engine** to maximize the dataflow, which is a concept like transform the program wiht instruction-flow to dataflow design. That is, the flow of data is the first priority. For its disadvantage, I think utilization problem is also not very well. Users uses the several SOTA(State-of-the-art) models. And each uses the chip inefficiently. I think this is because of the bottleneck of the CSCA. Because the mapping problem is a NP problem. That is, when the innerest statement is large enough(Which is none polynomial order) to an extent, the CSCA will not be able to map these statements to the hardware easily. Then, It will need to config the CSCA multiple times in order to complete the mapping problem(I think they solve it by split the statement into several pieces). Then, they also said that the config time or setup time is also long(40 cycles, 200 cycles). which is like 10ns * 200cycles = 2000ns = 2 micro seconds. And, in 200 cycles, the CPU can compute up to 200 instructions. And the CSCA uses just for config the data. But, I think the philosophy is good, uses the DFG to abstract the problem, the mapping problem is a unsolvable problem currently.(Maybe the quantum computing can solve it) <center> <img src="https://i.imgur.com/f2ETIQY.png" width=40%/> <img src="https://i.imgur.com/bTR7qAn.png" width=40%/> </center> ## Presentation video link ::: info TODO - please record a video to present the paper and share your thoughts ::: - [Video Link](https://youtu.be/xbQ_6vzItBA) - Recommend to use 1.25x playback speed/ ## References ::: info TODO - If you have read any referenes, you can put the link here. Teacher will check your references to see if ::: - https://www.linleygroup.com/uploads/simple-machines-wp.pdf - https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-v4-mlperf-2-0-results - https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-v4-mlperf-2-0-results