# Accelerator Architecture
###### tags: `Accelerators`
## Central computation architecture

* Mutilple filters will be sent out into the PE array to enable parallel computation.
* benefit for computing large kernel-sized CNN; needs to reconstruct the array when computing the small kernel-sized CNN
## Sparse computation architecture

* The computing unit(CU) Engine Array is made of 16 3x3 kernel-sized convolution units, providing benefits to compute small kernel-sized convolution operations and simplify the data flow.
* When computing a kernel size that is larger than 3x3, a kernel decomposition technique is need.
### kernel decomposition
1. if filter's kernel size is not multiple of 3:
* zero padding weights
2. extended filters will be decomposed into several 3x3-sized filters.
3. The output result of each decomposed filter will be summed together


## DianNao

* 減少對記憶體存取次數
* NFU-1: 16x16=256個乘法器
* NFU-2: 16x(8-4-2-1加法樹)
* NFU-3: 16個activation unit
## ShiDianNao



### 與TPU比較
* 訊號直接拉進PE裡,PE陣列無法做成較大的規模
* 每一次數據傳遞的方向都不同
| Name | target |
| -------- | -------- |
| DaDianNao | datacenter scenario|
| ShiDianNao| CNN application |
| PuDianNao | General machine learning accelerator |
## Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture

### NVDLA Style

### Non-Uniform Work Partitioning
* PEs closer to the data producers will perfrom more work to maximize physical data locality, while PEs that are further away will do less work to decrease the tail latency effects.
* 
### Comunication-Aware Data Placement
* In large-scale MCM system where on-chip buffes are sparially distributed among the chiplets, communication latency becomes highly sensitive to tje physical location of data.
* 
### Cross-Layer Pipelining
* 
## Heterogeneous Dataflow Accelerators for Multi-DNN Workloads

* Benefits:
* Selective Scheduling: each layer runs on its preferred sub-accelerator.
* Layer Paralleslism
* Low Hardware Cost for Dataflow Flexibility: no need to reconfigure.
* Challenges:
* Reduced Parallelism for each layer
* Shared Memory and NoC Bandwidth
* Scheduling under Memory and Dependence Constraints to Minimize Dark Silicon

## My work
1. Run mutliple models on single accelerator
2. architecture?
3. need to consider QoS
# Reference
> https://www.intechopen.com/chapters/58659
> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9407116
> https://ieeexplore.ieee.org/document/7284058
> https://dl.acm.org/doi/10.1145/3352460.3358302
> https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8114708