# ARC Processors for AI PPA = Performance, Power, Area ### DL SoC Design Chanllenges - Specialized Processing - Memory Performance - Real-Time Connectivity --- Heterogeneous with Scalability Hardware-Software Solution for High Accuracy Vision Processing ### Heterogeneous Processing - Scalar, Vector, DNN Accelerator ### Scalable Processing Units - One, two or four vision units - Scalable CNNs --- ### EM - low power (3uW/MHz) - 3-stage pipeline - small --- ### Convolutional Nerual Networks (CNN) key compute kernel is convolution - 1D for voice data - 2D for image ### partitioning 1. Layer-based partitioning - High throughput 2. Frame-based partitioning - Maximum throughput 3. Spatial partition - reuse 4. Channel partitioning - reuse features across slices --- ## ARC EMxD for ML ### EMxD for always-on AIOT applications - Always-on IOT - Example always-on functions - Always-on voice command - Visual object detection - Health & fitness monitoring - Low power consumption is key requirement --- 400 MHz Model inference Time:35ms Ave. power cosumption:2.5 mW Peak power cosumption:43.6 mW MetaWare Development Toolkit --- ### compile & link ``` ccac -arcv2em main.c -o main.out ``` Optimasations ``` ccac -arcv2em -O main.c -o main.out ``` ### Compile for ARC extension instructions -Xswap -Xmpy -Xmpy16 -Xdiv_rem -Xbs -Xsa -Xcd ``` ccac -arcv2em -Xsa main.c -o main.out ``` faster code execution, better density. -Xtimer0/1 Specify the ARC EM timers -Xtc Specify the 64bit Real Time Counter -Xstack_check Tells compiler that hardware stack checking is enabled. ### DSP support -Xdsp[1/2] Enable DSP instructions. -Xdsp_complex Enable complex arithemetic DSP instructions. -X_dsp_ctrl[=up | convergent, noguard | guard, preshift | postshift] fine-tune the compiler's assumptions about the rounding, guard bit and fractional product shift behavior. -Hdsplib Link in the DSP lib. -Hfxapl Use the Fixed Point API support lib. -Hitut Link to the ITU-T lib. --- ### Source Level Debugging using `-g` ``` ccac -g -arcv2em main.c -o main.out mdb -arcv2em main.out ``` ### Host link Runtime includes hostlink by default. ``` ccac -Hostlib= main.c ``` --- ### Assembler Access from c/c++ - Intrinsic functions _nop(), -sr(), -lr()... - Inline assemble _ASM("sub %r0, %r0, %r1") - Use SVR4 enhanced macros _ASM int NAME() --- inc → .h src → .c / .cc - make 編譯 - make flash 產生燒錄檔 --- 吉吉ㄉ指令集整理 ### UART init ```c= extern HX_DRV_ERROR_E hx_drv_uart_initial(HX_DRV_UART_BAUDRATE_E baud_rate); // UART initial API, should be called first before you use UART ``` (HX_DRV_UART_BAUDRATE_E baud_rate) options are bellow ```c= UART_BR_9600 = 0, /**< UART bard rate 9600bps */ UART_BR_14400 = 1, /**< UART bard rate 14400bps */ UART_BR_19200 = 2, /**< UART bard rate 19200bps */ UART_BR_38400 = 3, /**< UART bard rate 38400bps */ UART_BR_57600 = 4, /**< UART bard rate 57600bps */ UART_BR_115200 = 5, /**< UART bard rate 115200bps */ UART_BR_230400 = 6, /**< UART bard rate 230400bps */ UART_BR_460800 = 7, /**< UART bard rate 460800bps */ UART_BR_921600 = 8, /**< UART bard rate 921600bps */ ``` example: ```c= hx_drv_uart_initial(UART_BR_115200); // init and baud is 115200 ``` ### UART send data ```c= hx_drv_uart_print("URAT_GET_STRING_START\n"); hx_drv_uart_print("String cnt: %d\n\n", uart_rx_cnt); //Not support %f hx_drv_uart_print("Echo string: %s\n", uart_rx_str); ``` ### UART get data ```c= hx_drv_uart_getchar(&data_buf); ``` - UART 使用前需要**初始化** - 初始化後才可以傳送接收資料 - 鮑率建議 115200 (Tera Term VT > 設定 > 連接埠 > 位元速率 115200) --- ### Accelerometer init ```c= hx_drv_accelerometer_initial(); ``` ### Accelerometer FIFO count get ```c= // Check how many data in the accelerometer FIFO. // Each count represent 1 set of x-axis,y-axis,z-axis data. available_count = hx_drv_accelerometer_available_count(); ``` ### Get 1 package from Accelermeter FIFO ```c= float x, y, z; hx_drv_accelerometer_receive(&x, &y, &z); ``` --- ### ARC EM9D EM9D is a RISC core with DSP ISA exrensions. - Vector MUL/MAC operactions Increase number of MAC operations per cycle. - Zero overhead loops Eliminate loop management instructions. - XY memory and Address Generation Units(AGU) Access wrights and feature map data implicity using instruction-level paralleism. AGU regidter allow: implict load, unpack, sign extension, store & address pointer update... all in a single cycle. low-power IOT application with use CNNs or RNNs --- tensorflow lite for microcontrollers (TFLM) workflow 1. Train a model 2. Run inference benefits of using TFLM - easier graph mapping - use famaliar tf tooling - access optimized MLI kernel --- CNN - Convolution - Activation Function - Pooling - Fully Connected --- T06\n