Towards Deep Learning using TensorFlow Lite on RISC-V

# Towards Deep Learning using TensorFlow Lite on RISC-V ###### tags: `Accelerators` ###### paper origin: CARRV2019 ###### papers: [link](https://projects.iq.harvard.edu/files/edge/files/carrv_workshop_submission_2019_camera_ready.pdf) ###### slides and video: `none` # 1. INTRODUCTION ## Problem * The net application-level speed-up is determined by the relative computational complexities of the components listed below as well as the **overheads associated with communication between the host and the accelerator**. * pre-processing the inputs to render them consumable by a neural network * running a neural network inference using these inputs * post-processing the predictions generated by the network. * Applications that **involve frequent data and/or control exchanges between the host and accelerator land up severely under-utilizing the accelerator** and **may not see a net benefit of offloading work from the host**. ## Solution * Our solution hinges on **developing ISA extensions customized for machine learning kernels** and **designing a custom in-pipeline execution unit** for these specialized instructions. * Overview of the Software infrastructure. **"Intrinsics" are implemented using C inline assembly functions**. ![](https://i.imgur.com/CgwWNHc.png) ## Contributions * They **cross-compiled the TensorFlow Lite source code for RISC-V ISA and executed .tflite models on Spike**. * With the infrastructure in place, they **generate a binary that can run on a RISC-V processor that has micro-architectural support for the RISC-V V ISA extension**. # 2. Implementation * These instructions are supported using C inline assembly functions. ![](https://i.imgur.com/gagUezg.png) * The subset of RISC-V Vector ISA extension implemented in our software ecosystem. ![](https://i.imgur.com/jfSYah5.png) ## Compiler support for ISA extensions * The C inline assembly functions are **compiled** into assembly code using the **RISC-V GCC tool-chain**. * The assembly code is then converted into machine code using **GNU assembler** . ## Instruction simulation support on Spike ISS * Spike is a **RISC-V Instruction Set Simulator** and implements a functional model of RISC-V processor. * **Spike is a functional simulator that ignores internal delays such as I/O accesses or memory transactions. Therefore, the simulations are not cycle accurate**. ![](https://i.imgur.com/DQLg9vD.png) ## RISC-V target for TensorFlow Lite * **TensorFlow Lite is a lightweight deep learning framework for mobile and embedded devices**. * The TensorFlow Lite source code has two implementations; reference_ops and optimized_ops, for machine learning kernels such as convolution and depthwise-convolution. * The **reference_ops** implementation is portable, hardware-independent and uses standard C/C++ libraries. * The **optimized_ops** is a hardware specific optimized implementation of kernel operations using gemmlowp, Eigen libraries and other processor specific optimizations. * For example, **in the case of ARM processors**, the **optimized_ops implementation leverages gemmlowp, Eigen libraries and Neon** instructions to optimize kernel operations. * To support RISC-V target for Tensorflow Lite, we modified some functions to remove library dependencies not supported by Newlib 1 in reference_ops . * The **C inline assembly functions** were used for constructing SIMD-aware optimized functions to be used in optimized_ops implementation for RISC-V vector processors. ![](https://i.imgur.com/qh3zU8c.png) # 3. Result * They used gem5 in full system mode with ARM A-class, 4-stage pipeline High Performance In-order core configuration. * The ARM HPI was configured with 16KB L1 Icache, 16KB L1 Dcache and without L2cache. * For RISC-V, RV-base and RV-opt represents the RISC-V crosscompiled binaries of TensorFlow Lite using reference_ops and optimized_ops, respectively. * The Rocket core is configured with 16KB L1 Icache, 16KB L1 Dcache and without L2cache, as the current version of Rocket chip does not support L2cache. * Comparison of committed instructions, cycles and IPC for ARM-base, **RV-base-v1 without loop optimization** and **RVbase-v2 with loop optimization** for four variants of MobileNet. Here, Mobilenet-v1 (0.25, 128) means MobileNet-V1 model for input size of 128x128 pixels and 0.25 depth multiplier. The depth multiplier changes the number of channels in each layer. ![](https://i.imgur.com/eBSojWu.png) * List of deep learning models using in our evaluation. CONV = Convolution layer, LSTM = Long Short Term Memory. ![](https://i.imgur.com/c0aRp3P.png) * Number of committed instructions for RV-base-v2, ARM-base, RV-opt-v1 optimized with 128bits registers, ARM-opt and RV-opt-v2 optimized with 256bit registers for various deep learning models. ![](https://i.imgur.com/j4oW2Ve.png) * In deep learning models where ‘CONV' are the dominant layers, RV-opt-v1 has consistently less instructions than ARM-opt. * In the case of models where LSTM layers are dominant, ARM-opt has consistently less instructions than RV-opt-v # 4. Conclusion * In this paper, they present the software infrastructure we developed to support compilation and execution of machine learning models used in TensorFlow Lite framework. * On average, we achieved a **8X reduction on number of committed instructions** using RV-opt-v1 implementation in comparison to RV-base. * We see an additional **∼2X reduction** in the number of committed instructions using RV-opt **with 256bits register width(compare with 128 bits)**. # 5. Ours * Environment * VPU: Vicuna * RTL simulation: VCS(C++, system verilog) * Power simulation: Destiny * Benchmark: RVV-conv * Architecture ![](https://i.imgur.com/NGsrg2w.png) * Our cache: * Tag is not required, only need two SP to point to memory(R/W). * No TLB. * No need to alignment. * Higher hit rate than original cache.(Original miss rate is about 31%)![](https://i.imgur.com/T8zFjQR.png) * Pros: save sram area, faster access (because no need to do tag comprasion), higher scalar and vector data hit rate, reduce memory access times, more fit our VRF, good for unorder load and save. * Our Vector register file * Support latest rvv vsetvli. * Synthesizable. * One L/S instruction can move multi-data.(In our case :transport_times = (VL x SEW)/BANDWIDTH) * Reference from AXI4. ![](https://i.imgur.com/tsnOnDJ.png) * Our VPU * Simulate by VCS.(C++) * Two stage. * Our API(future work) * Control vector register file. (Using some instruction like vrgather, vlxuei? to implement convolution and some acceleration method like weight stationary) * Power simulation * ![](https://i.imgur.com/N7inzhA.png) * Our python simulator * Can simulate. * Our dataflow (old version) ![](https://i.imgur.com/pppe51L.png) ![](https://i.imgur.com/SmBSx3u.png) * Our expect result ![](https://i.imgur.com/pphVZj1.png) * Conclusion: * 我們期望設計一套C++ API去輔助programmer生成更加適合卷積運算的dataflow。我們將原本的L1 cache進行修改成不須tag便可執行，並將scalar指令與vector指令各自使用的區塊分開，加上一些細部調整，使cache可以減少去access memory的次數，從而降低功耗。並且我們的方式也可以降低vector register file去存取cache的次數，藉由降低cache存取次數及簡化cache使其降低功耗，我們可以從cache及memory這兩個部分來節省大量的access consumption. * Pros: * Reduce memory access times * reduce cache access times * reduce cache consumption * 可以維持riscv的general purpose性質，再不去增加部必要之額外指令的情況下完成我們期望的目的。 * Cons: * API會產生大量的vrgather等等vector register file存取的指令，進而增加整體運算時間(cycle數)，且會增加些微額外功耗。 * 目前只能做conv