# ARC Processors for AI
PPA = Performance, Power, Area
### DL SoC Design Chanllenges
- Specialized Processing
- Memory Performance
- Real-Time Connectivity
---
Heterogeneous with Scalability Hardware-Software Solution for High Accuracy Vision Processing
### Heterogeneous Processing
- Scalar, Vector, DNN Accelerator
### Scalable Processing Units
- One, two or four vision units
- Scalable CNNs
---
### EM
- low power (3uW/MHz)
- 3-stage pipeline
- small
---
### Convolutional Nerual Networks (CNN)
key compute kernel is convolution
- 1D for voice data
- 2D for image
### partitioning
1. Layer-based partitioning - High throughput
2. Frame-based partitioning - Maximum throughput
3. Spatial partition - reuse
4. Channel partitioning - reuse features across slices
---
## ARC EMxD for ML
### EMxD for always-on AIOT applications
- Always-on IOT
- Example always-on functions
- Always-on voice command
- Visual object detection
- Health & fitness monitoring
- Low power consumption is key requirement
---
400 MHz
Model inference Time:35ms
Ave. power cosumption:2.5 mW
Peak power cosumption:43.6 mW
MetaWare Development Toolkit
---
### compile & link
```
ccac -arcv2em main.c -o main.out
```
Optimasations
```
ccac -arcv2em -O main.c -o main.out
```
### Compile for ARC extension instructions
-Xswap
-Xmpy
-Xmpy16
-Xdiv_rem
-Xbs
-Xsa
-Xcd
```
ccac -arcv2em -Xsa main.c -o main.out
```
faster code execution, better density.
-Xtimer0/1
Specify the ARC EM timers
-Xtc
Specify the 64bit Real Time Counter
-Xstack_check
Tells compiler that hardware stack checking is enabled.
### DSP support
-Xdsp[1/2]
Enable DSP instructions.
-Xdsp_complex
Enable complex arithemetic DSP instructions.
-X_dsp_ctrl[=up | convergent, noguard | guard, preshift | postshift]
fine-tune the compiler's assumptions about the rounding, guard bit and fractional product shift behavior.
-Hdsplib
Link in the DSP lib.
-Hfxapl
Use the Fixed Point API support lib.
-Hitut
Link to the ITU-T lib.
---
### Source Level Debugging
using `-g`
```
ccac -g -arcv2em main.c -o main.out
mdb -arcv2em main.out
```
### Host link
Runtime includes hostlink by default.
```
ccac -Hostlib= main.c
```
---
### Assembler Access from c/c++
- Intrinsic functions
_nop(), -sr(), -lr()...
- Inline assemble
_ASM("sub %r0, %r0, %r1")
- Use SVR4 enhanced macros
_ASM int NAME()
---
inc → .h
src → .c / .cc
- make
編譯
- make flash
產生燒錄檔
---
吉吉ㄉ指令集整理
### UART init
```c=
extern HX_DRV_ERROR_E hx_drv_uart_initial(HX_DRV_UART_BAUDRATE_E baud_rate);
// UART initial API, should be called first before you use UART
```
(HX_DRV_UART_BAUDRATE_E baud_rate) options are bellow
```c=
UART_BR_9600 = 0, /**< UART bard rate 9600bps */
UART_BR_14400 = 1, /**< UART bard rate 14400bps */
UART_BR_19200 = 2, /**< UART bard rate 19200bps */
UART_BR_38400 = 3, /**< UART bard rate 38400bps */
UART_BR_57600 = 4, /**< UART bard rate 57600bps */
UART_BR_115200 = 5, /**< UART bard rate 115200bps */
UART_BR_230400 = 6, /**< UART bard rate 230400bps */
UART_BR_460800 = 7, /**< UART bard rate 460800bps */
UART_BR_921600 = 8, /**< UART bard rate 921600bps */
```
example:
```c=
hx_drv_uart_initial(UART_BR_115200);
// init and baud is 115200
```
### UART send data
```c=
hx_drv_uart_print("URAT_GET_STRING_START\n");
hx_drv_uart_print("String cnt: %d\n\n", uart_rx_cnt); //Not support %f
hx_drv_uart_print("Echo string: %s\n", uart_rx_str);
```
### UART get data
```c=
hx_drv_uart_getchar(&data_buf);
```
- UART 使用前需要**初始化**
- 初始化後才可以傳送接收資料
- 鮑率建議 115200
(Tera Term VT > 設定 > 連接埠 > 位元速率 115200)
---
### Accelerometer init
```c=
hx_drv_accelerometer_initial();
```
### Accelerometer FIFO count get
```c=
// Check how many data in the accelerometer FIFO.
// Each count represent 1 set of x-axis,y-axis,z-axis data.
available_count = hx_drv_accelerometer_available_count();
```
### Get 1 package from Accelermeter FIFO
```c=
float x, y, z;
hx_drv_accelerometer_receive(&x, &y, &z);
```
---
### ARC EM9D
EM9D is a RISC core with DSP ISA exrensions.
- Vector MUL/MAC operactions
Increase number of MAC operations per cycle.
- Zero overhead loops
Eliminate loop management instructions.
- XY memory and Address Generation Units(AGU)
Access wrights and feature map data implicity using instruction-level paralleism.
AGU regidter allow: implict load, unpack, sign extension, store & address pointer update... all in a single cycle.
low-power IOT application with use CNNs or RNNs
---
tensorflow lite for microcontrollers (TFLM)
workflow
1. Train a model
2. Run inference
benefits of using TFLM
- easier graph mapping
- use famaliar tf tooling
- access optimized MLI kernel
---
CNN
- Convolution
- Activation Function
- Pooling
- Fully Connected
---
T06\n