Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads

# Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads ###### tags: `Accelerators` ###### paper origin: ISCA2020 ###### paper:[link](https://dl.acm.org/doi/abs/10.1109/ISCA45697.2020.00023) ###### video:[link1](https://www.youtube.com/watch?v=uJbTEXFAsLs) [link2](https://www.youtube.com/watch?v=nBvvurYy1kk) ###### slides:[link](https://mlhardware.github.io/2020/groq.pdf) ## 1. Introduction * Motivation * Computationally-intensive deep learning algorithms continue to grow in size and complexity, presenting serious scalability, performance, and usability challenges for traditional CPU and GPU architecture. * Problems * Providing abundant ALUs, hardware complexity of many microarchitectures makes it difficult to reason about runtime stalls. * while microarchitectureal enhancement such as caches, branch predictors, and prefetchers help in improving performance, they do not bound worst-case performance. * Solution * Based on rethinking conventional chip multiprocessor organization, resulting in a new architecture centered around the tensor abstraction, a *Tensor Streaming Processor (TSP)*. ## 2. TSP overview ### Functional Slicing ![](https://i.imgur.com/hmU4DP8.png) * Conventional chip multiprocessor(CMP): Fig.1.(a) * each tile is an independent core which is interconneted using the on-chip network to exchange data between cores. * 5 stages **IF, ID, EX, MEM, WB** to update the result in the GPRs * each tile is a heterogenrous collection of functional units but globally homogeneous * Tensor Streaming Processor(TSP): Fig.1.(b) * TSP reorganizes the homogeneous two-dimensional mesh of cores into *functionally sliced* microarchitecture. * 5 functions unit * instruction control and dispatch (ICU) * memory (MEM) * integer arithmetic (INT) * float point arithmetic (FPU) * network (NET) ![](https://i.imgur.com/JkQLUgq.png) * This organization naturally decomposes instruction flow in the vertical dimension, and data flow in the horizontal dimension as it passes over different function types. ### Parrallel lanes and streams * Data parallelism for each slice's SIMD execution is provided via a programming abstraction called *parallel lanes*. * *instructions* flow Northward from the ICUs to the functional slices, while data flow East and West between functional slices. ![](https://i.imgur.com/DQRGq3K.png) * Functional slices interact with stream of data in a *producer-consumer* fashion. They *consume* operands from streams and *produce* results onto a stream. * Stream are implemented in hardware by a chip-wide *stream resgister file (SR)*. They transport operands and results between slices. ## 3. Architecture Overview ![](https://i.imgur.com/xpBUbwJ.png) * 320-lane programming abstraction * each tile in the on-chip mesh operates on 16-lane unit as a *superlane* * a *superlane* represents the architecture's minimum vector length of 16 element * the vertical composition of 20 tiles to form a functional slice produce a maximum vector length of 20x16=320 elements * 144 independent instruction queues (ICUs) * each can issue one or more instructions per cycle * compiler has explicit control of the *program order* in each instruction queue * 64 logical stream per lane * for moving operands or results on-chip * 32 streams Eastward, and 32 stremsn Westward * 220 MiByte of globally shared SRAM * 32 bytes per lane of stream bandwidth * instruction control unit (ICU) * instruction fetch * inter-slice synchronization * vector execution module (VXM) * consists of a 4x4 mesh of ALUs in each lane for point-wise arithmetic operations * matrix execution module (MXM) * consists of four independent 2D MACC (multiply-accumulate) arrays that operate on int8 or fp16 data types * switch execution module (SXM) * for intra-superlane and inter-lane switching by rearranging elements of vectors. * memory module (MEM) * composed of 44 parallel slice of SRAM * each slice provides 13-bits of physical addressing of 16-byte memory words, each byte maps to a lane, for a total of 220 MiBytes of on-chip SRAM. * Chip-to-chip (C2C) module * exchange vectors between a pair of chips. ### Prallel stream programming model ![](https://i.imgur.com/SYC6WDc.png) * TSP's programming model is a producer-consumer model where each functional slice acts as a consumer and a producer of one or more streams * a vector is read from main memory is given a stream identifier (0...31) and direction: *eastward* or *westward* * Once the vector is read into a stream register, it is a *stream* * the stream flowing in the given direction, passing through function slices ### Memory model * Memory is partitioned into two hemispheres, each having 44 slices * Each MEM slice comprises 20 tiles, arranging in a vertical stack, yeilding a 2.5 Mibyte per-slice capacity. * Slices of memory are partitioned into 16 byte words, each spread across a superlane, and each byte of each word occupying a lane of an input *channel* or an output *feature*. ### staggered instruction execution * An instruction executes as a SIMD operation on stream-supplied operand vectors(up to 320 elements). * The 320-element SIMD instruction is pipelined across vertical stack of tiles in the slice. * At the schedule time the instruction will be issued to the bottom-most tile of the slice. In the subsequent cycle, the instruction will be propagated to the next tile northward in the slice, which in turn executes the instruction on the next 16-element super lane of operand vectors. * The data for succesive 16-element superlanes are lagging by 1 cycle to accomodate the pipelined execution of an MXM instruction. * ![](https://i.imgur.com/8BkBRIv.png) ### Chaining functional slices * Each functional slice has a predefined set of instructions (Read, Write, Add, Mul, etc) that define its supported operations. * A more complex sequence of operation is composed of one or more slices coordinating in a producer-consumer manner. Each functional slice can choose the direction of its result stream. (stream can be logically "turned around") * ![](https://i.imgur.com/wW4w0FG.png) a composite funciton, F, is an amalgam of several functional slices chained together ### Scalable vectors * The number of elements in each vector can vary from 16 elements all the way to 320 elements. * We provide instructions to configure each tile for low-power mode to effectively power-down any unused superlane and reduce the power consumed. ## 4. Instruction Set ![](https://i.imgur.com/1erecXx.png) * The TSP's ISA exposes tempoeral information about each instruction to allow the compiler precise control of each instruction's dispatch time. * d~func~: **functional delay** * The d~func~ timing parameter allows the compiler to reason about when the output of an instruction will be available on the stream-registers. * d~skew~: **instruction-operand skew** * The d~skew~ parameter on each instruction informs the compiler how to schedule the operand arrival times with the instruction dispatch time in order to get them to properly *intersect* in time and space. * The compiler is solving a two-dimensional scheduling of instructions and data in both time and space. * The *execution time* of an instruction: ![](https://i.imgur.com/9S6j80g.png) (N: number of tiles in the functional slice, δ(j,i): the distance between SR~i~ and SR~j~ in cycle) ## 5. ResNet-50 ### ResNet50 performance * An inference query in < 43us, yeilding a throughput of 20.4K images per second, which is **2.5x** speedup relative to the Google TPU v3. * TSP has an inference latency of only 49 us for a single image sample, which is nearly a **5x** reduction in end-to-end latency compared to Intel/Habana's Goya inference chip. ### Deterministic Performance * The TSP's hardware eliminates arbiters and other reactive elements in the data path, making performance deterministic and precisely predictable from run-to-run execution. * Based on current ResNet50 implementation, ResNet101 throughput will be 14.3k IPS and ResNet152 throughput will be 10.7k IPS.