taneously. # VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations
[Paper Source](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9407226)
2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
* Exploits parallelism to improve performance
* While instructions and thred level prallelism are extensively studied, there are still many unexplored opportunities to achieve significant performance and energy improvements from Data-Level-Parallelism (DLP)
* DLP can be exposed to the hardware by vector computations, where single instruciton operate over multiple data stream (SIMD)
* Better performance, high energy efficiency and greater resource utilization
* Ultimately, the effectiveness of a vector architecture depends on the quality of the vectorized code
* Sparse matrix calculations difficult to vectorize. This is key kernel in High Performance Computing (HPC), AI and big data workload. Two important applications are:
* Sparse Matrix Vector Multiplication (SpMV):
* An important component for the High Performance Conjugate Gradiant (HPCG) code, an alternative to LINPACK for rating super computers
* It is also fundamental in AI applications such as Support Vector Machine computations via gradient descent
* Sparse Matrix Matrix Multiplication (SpMM)
# Vector Architecture
[Source](https://www.nec.com/en/global/solutions/hpc/articles/tech01.html)
Vector architecture features large capacity of vector registers. Vector Engine has 64 set of vector registers and each register can accommodate 256 elements of 8 Byte data (2KByte in total; 8 * 256 * 64 = ).

Vector core consist of
1. Vector Processing Unit (VPU): Powerful computing capability and high bandwidth memory access; VPU has a very powerful memory access capability and its theoretical memory bandwidth per core exceeds 400 GB/s for load and store each
2. Scalar Processing Unit (SPU): Provides basic functionality as a processor; Fetch, Decode, Branch, Add, Exception Handling, etc. This controls the status of complete core including VPU.

3. Address generation and translation/ Data forwarding crossbar: Address generation and translation and Request crossbar make memory load/store packets and forward them to the right port of the memory network. On the other hand, Reply crossbar forwards reply packets from memory network to 32 VPPs. These blocks are designed to support the continuous operation for VPU. In the vector processing, the pre-load feature is very important to hide the latency of memory load and avoid a lack of necessary data to Vector pipelines. When address generation and translation block receives vector load instructions from SPU in advance, it can performs address translation for multiple vector elements having separate memory addresses all at once. And then it can make and isssue upto 17 memory packets simultaneously.