Wire-Aware Architecture and Dataflow for CNN Accelerators

# Wire-Aware Architecture and Dataflow for CNN Accelerators ## Introduction ###### tags: `Accelerators` ###### paper origin: MICRO '52 ###### paper: [link](https://dl.acm.org/doi/10.1145/3352460.3358316) ### Motivation * Many of accelerators expend significant energy fetching operands from various levels of the memory hierarchy. * Eyeriss requires non-trivial storage for scratchpads and registers per PE to maximize reuse. * Many accelerators access large monolithic buffers/caches as the next level of their hierarchy. * Eyeriss has a 108 KB global buffer * Google TPUv1 has a 24 MB input bufffer ### Problem * Data movement is orders of magnitude more expensive than the cost of compute. * At 28nm, a 64-bit fp multiply-add consumes 20 pJ * transmitting the corresponding operand bits across the chip length consumes 15x more * accessing a 1MB cache consumes 50x more * fetching thos bits from off-chip LPDDR consumes 500x more ### Proposed Solution * We create a new wire aware accelerator WAX, that implements a deep and distributed memory hierarchy to favor short wires. * We introduce a novel family of dataflows that perform a large slice of computation with high reuse and with data movement largely confined within a tile. ## Architecture ### A Wire-Aware Accelerator (WAX) ![](https://i.imgur.com/CrG6agZ.png) * Conventional large caches are typically partitioned into several subarrays, connected with a H-Tree network. * W Register: maintain weights. * P Register: maintain partial sums. * A Register: maintain activations, with shifting capabilities. * This design has two key features: 1. reuse and systolic dataflow are achieved by using a shift register, which ensures that operands are moving over very short wires. 2. The next level of the hierarchy is an adjacent subarray of size say 8KB, which is much cheaper access than TPU or Eyeriss. ### Efficient Dataflow for WAX (WAXFlow 1) ![](https://i.imgur.com/UtYrRO8.png) #### Placing data * We first fill the subarray with 1 row of input feature maps(R0). * We then place the first element of 32 kernels in row R2. Similarly, ither elements of the kernel are placed in other rows of the subarray. * Some rows of the subarray are used for partial sums. #### Computation of the first slice 1. The first row of input fearture maps(R0) is read into the activation register A and the first row of kernel weights(R2) is read into weight register W. 2. The pair-wise multiplications of the A and W registers produce partial sums for the first green-shaded diagonal of the output feature maps. This is written into row R128 of hte subarray. 3. The activation register then performs a right-shift. 4. Another pair-wise multiplication of A and W is performed to yield the next right-shifted diagonal of partial sums. (Reapeting for a total of 3 times, yielding partial sums for the entire top slice of hte output feature maps and saved in rows R128-159). #### Computation of the next slice 1. A new row of kernel weights (R3) is read into the W register. 2. The computations performed in this next slice continue to add the same green partial sums computed in the first slice. #### Summary * Reuse a row of kernel weights for 32 consecutive cycles. * A row of input activations is reused for 96 consecutive cycles before it is discarded. * Each partial sum is revisisted once every 32 cycles. #### drawback * partial sums are accessed from the subarray every cycle, causing a significant energy overhead. ### WAXFlow-2 ![](https://i.imgur.com/0S2c8KN.png) #### Placing data * Each row of the subarray is split into P partitions. Each partition has input feature maps corresponding to different channels. * We find that energy is minimized with **P=4**. * The first row of activations, R0, contains the first 0 ifmap elements from four channels. * The first filter row, R2< is also partitioned into four channels. #### Computation of the fist cycle * After the pair-wise multiplications of R0 and R2 in the first cycle, the results of the 0th, 8th, 16th, and 24th multiplier are added together. * the 1st, 9th, 17th, and 25th multiplier results are added, yeilding the next element of the ofmap. * This cycle produces partial sums for the eight idagonal elements of the top slice, which are saved in P register. #### next cycle * The A register first performs a shift(within each channel) * The results of the multiplications are added to produce eight new partial sums that are stored in diffferent entries in the P register. #### next step * After 4 cycles, the P registers contain 32 partial sums that can now be written into a row of the subarray. * After 8 cycles, the channels in the A registers have undergone a full shift and we are ready to load a new rows into the A and P registers. #### Compared to WAXFlow-1 * The partial sums result in subarray idle cycles, some of the other data movement cna be overlapped with slice computation. * WAXFlow-2 is better in terms of both latency and energy. ### WAXFlow-3 *We saw that WAXFlow-2 introduced a few adders so that some intra-cycle aggregation can be performed, thus reducing the number of psum updates in the subarray. We now try to further that opportunity so that psum accesses in the subarray can be further reduced.* ![](https://i.imgur.com/ymAhlYi.png) #### Placing data * a row of weights from a single kernel is placed together in one kernel row partition. #### Computation * The multiplications performed in a cycle first undergo an intra-parition aggregation, followed by an inter-partition aggregation. Thus,a single cycle only produces 2 partial sums. * It takes 16 cycles to fully populate the P register, after which it is written into the subarray. ![](https://i.imgur.com/8qVJbNA.png) ## Methodology * Baseline: Eyeriss * area & energy values: verilog, 28nm FDSOI Technology * Energy & area of SRAM subarray and the H-tree interconnects: CACTI * Performance: simulator * Workload: VGG-16, ResNet-34, MobileNet ![](https://i.imgur.com/QSHvHza.png) ## Result ### Performance ![](https://i.imgur.com/DaAIMQo.png) * The figure includes a breakdown for all layers of VGG16 * In Eyeriss, data movement and computations in PEs cannot be overlapped; WAXFlow-3 spends a few consecutive cycles where the MACs read/write only the registers and do not read/write the subarray. * The Figure 8c shows that the data movement for partial-sum accumulation in WAX cannot be completely hidden and increases for later layers. ### Energy ![](https://i.imgur.com/iQduewz.png) * Scratchpad and register file energy in Eyeriss dominant * local subarray access(SA) is the dominant contributor for WAX. Without the limited partial-sum updates enabled by WAXFlow-3, this component would have been far greater. ![](https://i.imgur.com/NpcCVTE.png) ![](https://i.imgur.com/CmwrbtQ.png) * For deeper layers, the number of activations reduces and the number of kernels increases; this causes an increase in remote subarray access because kernel weights fetched from the remote subarray see limited reuse and activation rows have to be fetched for each row of kernel weights. * ![](https://i.imgur.com/biE18K5.png) * The impact of adding more banks (and hence more MACs) on WAX throughput and Energy consumption. * Throughput scales well until 32 banks and then starts to reduce because of netework bottlenecks from replicating ifmaps across multiple subarrays and because of the sequential nature and large size of the H-Tree.