1. Row Clone - HackMD

Classification: 1 RowClone: end2end design for bulk copy(initialize) on DRAM row buffer 2~5 ASIC accelerator for Automa/Neural Network 6~7 side channel attack for ASIC PIM (mainly on simulator) 8~9 two famous ReRAM accelerator, digital + analog 10 Multi-Party Computation Encryption Computation(Arithmetic Secret Sharing, x +) 11 Dynamic data resolution (existing Smart Memory has computation power) 12 14K lines of code work. End2end FPGA p-i-"storage" takeouts: 1. 只要有存储，就可以就（近）地计算（不同层次： Storage~Memory~Cache) 2. 哪些场景需要PIM? memory wall: (roofline model to prove); not only for NN (e.g. : resolution ) 3. 侧信道计算就是一坨模拟器拉的屎 4. 方向：1.自制硬件(FPGA/ASIC) 老算法(AES/NN/computing) 2. 老硬件新利用(bitline computing) 3.发掘商业硬件的PIS能力(智能内存) # 1. Row Clone https://dl.acm.org/doi/pdf/10.1145/2540708.2540725?casa_token=SAma3u6uew4AAAAA:jXmaPcgwjbeE-a6klo2QIQtV7_xWLxuGLGs7DaQH5P9Z1j7R7KAXkg_c6gC5Kpg6Z-8rlPdgGBD1hQ ## 1.Summary The paper proposed RowClone, a new and simple mechanism to perform bulk copy and initialization completely within DRAM. In general, the process of bulk copy is "src row ->row buffer -> dst row". In detail, if src row and dst row are within the same subarray, FPM(Fast Parall Mode) will be used. Otherwise, if it's an inter(a)-bank copy, PSM(Pipelined Serial Mode) will be used. Bulk zeroing or initilization are also based on bulk copy. For DRAM energy and bandwidth reductionin in bulk copy, RowClone can reach about 12X lower latency and 74X lower energy. RowClone is also easy to be implemented, only 2 ISA instuctions need to be added. ## 2. Strengths 1. It helps both bandwith performance and energy saving in single-core and multi-core system; 2. It provides granularity control for subarray and bank; 3. The bulk initilization is the use of bulk copy. Design Zero Zone for Zeroing operation; 4. It is not expensive in hardware leve and not hard to be implemented in ISA level. ## 3. Weaknesses 1. Evaluation work is based on simulators instead of real world hardware, which may cause deviation; 2. Only FPM can reach the best performance by obeying assumption that the data copy is always within the same subarray, however, PSM performs badly for latency. ## 4. Ideas 1. Knowing that FPM can reach 12x latency reduction but PSM can only reach 1.9x, is it possible to improve PSM operations in the inter(a)-bank copy? I think it is difficult because obviously there is a trade-off for "bulk copy" and "flexible copy". If we can apply additional more complex hardware structure, it seems possible to saitisfy both bulk or small copy in DRAM. 2. Is it possible to apply the "bulk-wise" into bulk-computation rather than simple bulk-initilization? In this paper <Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories>, the similar "bulk-wise" is used: inter-bank and inter-subarray operations for bulk-operands. # 5. Neural Cache https://ieeexplore.ieee.org/iel7/8401306/8416802/08416842.pdfLinks to an external site. ## summary The authors propose a novel accelerator design called "Neural Cache" that leverages bit-serial processing to improve the performance of deep neural network computations. The main idea behind Neural Cache is to store activations in a compact bit-serial format and perform bit-serial computations. Neural Cache is capable of the capable of fully executing convolutional, fully connected, and pooling layers in cache, and it proposed bit-serial implementation with transposed data layout to address the challenge for complex calculation with interaction between bit lines. The authors show that this approach can achieve high performance and energy efficiency for deep neural networks(improve inference latency by 18.3× over state-of-art multi-core CPU, 7.7× over server class GPU. throughput by 12.4× over CPU (2.2× over GPU), reducing power consumption by 50% over CPU (53% over GPU)), and that it is robust to the choice of precision. ## strengths 1. Extension of SRAM computation to support complex computing primitives, including addition, multiplication, reduction, via bit-serial implementation with transposed data layout. 2. Identification and exploration of a new application for SRAM-based computation: deep neural networks. 3. High efficiency, energy-saving and low-latency. ## weaknesses 1. Limited Assumption of 8-Bit Precision and Quantized Inputs: The paper assumes the use of 8-bit precision and quantized inputs in its design. While the authors cite other research to support the claim that 8-bit precision can provide sufficient accuracy for DNNs, it has been shown in other studies, such as "RecNMP," that some commercial models are not compressed to maintain high accuracy and avoid a decrease in commercial benefit. As a result, the assumption of 8-bit precision in the "Neural Cache" design may be limiting in terms of potential accuracy and commercial applicability. 2. Extra overload. Need either programmer awareness or a new hardware TMU to support transpose data layout. ## idea 1. At what cose of performance decreasing can we satisfy the 16/32-bit DNN in cache? 2. Can we apply this Neural Cache to Reat time system like atuo driving? ## 6&7 side channel attack https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9600458 ## Sum This paper investigates the vulnerability of resistive RAM (RRAM) in-memory computing (IMC) architectures to side channel attacks (SCAs) for reverse engineering (RE) of intellectual property (IP) implemented within the memory. The authors demonstrate SCARE by taking 2 recent IMC architectures (DCIM and MAGIC) as test cases and show that the adversary can use such templates and analysis to identify the structure of the implemented function. For each case, the authors show 2 attack models: for DCIM, leveraging the Power Drawn by OR and AND Array & the Precharge Power of AND Array; for MAGIC, leveraging the OR and AND Array Power Signature/Operation Time & Pre-Compute RRAM Write Operation Times. This paper also overcomes the process variation (PV) via statistical analysis of power/delay to filter the correct function. They propose countermeasures such as redundant inputs and expansion of literals to protect from SCARE. The study provides insights into the susceptibility of IMC architectures to SCAs and offers potential solutions to mitigate these attacks. ## Strengths 1. Novelty. SCARE is the first work on RE of IMC-based IP. 2. Practicality and Generalization. This method is non-invavasive. Also, the 2 real-world test cases can show it's practical to be applied. any function that can be implemented by DCIM and MAGIC can be targeted by SCARE, including an existing SHA-3 implementation. 3. Rigor. Countermeasures (redundant inputs and expansion of literals) are also provided. ## Weaknesses 1. The overhead-security trade-off makes countermearuses not that practical. Although the expansion of literals RE effort increases exponentially with the number of inputs, however, with 36% power overhead, the "masked model" still imposes a brute force search by only 3.04× RE effort, which is still worth doing for the adversary. 2. Limitation of in-RRAM computation primitives. This paper only discussed "AND" and "OR" primitives in RRAM, however, there is some new in-RRAM-computation arch support complex primitives (i.e. <REAL: Logic and Arithmetic Operations Embedded in RRAM for General-Purpose Computing>). Under that case, will the power/time anaylisi also be capable of detecting the more complex primitives and their minterms/... ? ## Idea This paper uses power/time analysis for SCA the PIM arch, are there other physical vairables also useful in SCA PIM? I found this paper for FPGA<Physical Side-Channel Attacks and Covert Communication on FPGAs: A Survey>, the temperature, or electromagnetic emission are also taken into consideration, where some DES/AES FPGA arch are vulnearable. https://arxiv.org/pdf/2209.02792.pdfLinks to an external site. ## summary This paper proposes a side-channel attack methodology on In-Memory Computing (IMC) architectures to extract model architectural information from power trace measurements without any prior knowledge of the neural network. The authors developed a simulation framework that can emulate the dynamic power traces of the IMC macros and performed side-channel attacks to extract information such as the stored layer type, layer sequence, output channel/feature size, and convolution kernel size. They demonstrated how measurable electrical characteristics can still pose a security vulnerability even for mixed-signal IMC architectures. Finally, they discussed potential countermeasures for building IMC systems that offer resistance to these model extraction attack. ## Strengths 1. The first model extraction attacks work on analog IMC-based DNN accelerators. 2. The authors provide a simulation framework that can emulate the time dependent data, .e.g. dynamic power traces, which can be extended for future work. 3. The paper discusses 3 classes of possible countermeasures. ## Weaknesses 1. Limited applicability and empirical results. Focus on IMC architectures based on a TSMC 28 nm process and experiment only on simulators. It may not be applicable to other types of IMC architectures or different process technologies and it is unclear how effective the proposed attacks are in practice. I remember that professor mentioned that using simulator could be a weakness, and the authors mentioned that it's impractical to perform experiments on large datasets. However, some physical phenomena here can happen in few nanoseconds, and in real world there are noise/misc waves. I think the experiments in real world even on some smaller datasets can lead to some practical and new insights (recall the paper using Facebook dataset finds new locality feature in data). 2. Strong assumption. In order to neglect power blurring caused by inherent pipeline optimization, this paper assumes "the attacker has the full control of the input and output data port, she/he can halt the next input to the system until the previous inference completion to gain more accurate power measurements", which is possibly not reachable in practice. 3. Countermeasures are not detailly tested. ## Idea 1. Several different physical channels can be used for FPGA SCA, why we cannot use them into ASIC/PIM? (I couldn't find any paper using thermal/electronic emission channel) --- because of the limitation of SIMULATOR! 2. The process to tell an operation/data from mixed signals in these 2 papers seems to be a recognition or matching problem, can we use AI to figure it out? (<Deep learning side-channel attack against hardware implementations of AES>) # 8 Prime 9 ISAAC https://dl.acm.org/doi/pdf/10.1145/3007787.3001140 ## summary This paper proposes a processing-in-memory (PIM) architecture called PRIME that utilizes resistive random access memory (ReRAM) to accelerate neural network (NN) applications. PRIME enables a portion of the ReRAM memory arrays to be configured as accelerators or normal memory on demand, providing an efficient solution to the challenges of the "memory wall" for future computer systems. The authors provide microarchitecture and circuit designs to enable morphable functions with low area overhead and propose an input and synapse composing scheme to overcome precision challenges. They also develop a software/hardware interface for software developers to configure full function subarrays to implement various NNs. PRIME achieves significant improvements in performance and energy efficiency for various NN applications, including MLP and CNN, and distinguishes itself from prior work on NN acceleration by benefiting from both the PIM architecture and the efficiency of ReRAM-based NN computation. ## strengths Innovative Architecture: PRIME is a novel PIM architecture that utilizes ReRAM fully to accelerate NN applications by enabling a portion of the memory arrays to be configured as accelerators or normal memory on demand. (Compared to ISAAC, PRIME is a "in"-memory arch rather than a "near"-memory one). PRIME is designed as three separate areas of banks, representing memory (Mem) subarrays, full function (FF) subarrays, and buffer subarrays. The Mem subarrays only store data, while the FF subarrays can perform both computations and storage. The buffer subarrays act as a data cache for the FF subarrays and can be used for regular memory when not acting as a cache. The Mem and buffer subarrays are connected directly to the FF subarrays via private ports, which minimizes bandwidth consumption by the Mem subarrays during buffer access. Unbelievable Performance: "PRIME improves the performance by∼2360× and the energy consumption by ∼895×" compared to "Dadiannao"! ## weaknesses Precision : The paper acknowledges that precision is one of the main challenges when using ReRAM crossbar arrays for NN computation. Although the authors propose an input and synapse composing scheme to overcome this challenge, further research may be required to optimize precision while maintaining high performance and energy efficiency. (e.g. ISAAC solved 16-bit precision). Possible Limited Generalizability: The paper presents experimental results for various NN applications using standard MLP and CNN. However, can we extend this arch to deal with strange NN with trival work? Lack of discussion about fault detect/tolerate/correct. ReRAM is known to suffer a high occurrence rate of memory faults. ## idea some paper shows that when convolution has been well optimized, the FC(fully conntection) layer has become the main bottleneck now (similart to the relation between SpMV and Sparse Trasposition) Fault Protection for ReRAM? <On-Line Fault Protection for ReRAM-Based Neural Networks> Can we apply the 16-bit mechanism from ISAAC to PRIME? Is 16-bit better than the flexible bit precision on PRIME? https://dl.acm.org/doi/pdf/10.1145/3007787.3001139 ## Summary This paper proposes an in-situ processing approach to accelerate machine learning algorithms using memristor crossbar arrays to store input weights and perform analog dot-product operations. The authors design a pipelined architecture, define new data encoding techniques and identify the best balance of storage/compute, ADCs and eDRAM storage on a chip. The proposed ISAAC architecture yields significant improvements in throughput, energy, and computational density compared to the state-of-the-art DaDianNao architecture. However, a balanced inter-layer pipeline, efficient handling of signed arithmetic, and bit encoding schemes are required to manage the high overheads of ADCs, DACs, and eDRAMs. On average, for a 16-chip configuration, ISAAC yields a 14.8× higher throughput than DaDianNao. The paper highlights the potential of analog computing for neural network acceleration, but also identifies several challenges that must be addressed to realize a full-fledged CNN architecture. ## Strengths 1. Significant Improvements: in multiple aspects: throughput, energy, and computational density compared to the state-of-the-art DaDianNao architecture. 2. Utilizes **in-situ analog arithmetic in crossbars** , provides higher computational density and energy efficiency, which is essential for neural network acceleration. 3. Addresses many practical challenges: **Pipelined** Design; **data encoding** to reduce DAC conversion overhead; how to support **16-bit precision** and so on. ## Weaknesses 1. Because of limitations on replication("replicate the weights in early layers several times to construct a balanced pipeline") ("In fact, in some benchmarks, the first layer has to be replicated more than 50K times to keep the last layer busy in every cycle"), ISAAC operates at low utilization in last layers. 2. Analog Circuit, and ReRAM as well, may have a higher fault rate than digital circuit and other types of RAM. However, this paper doesn't detailly illustrate the fault tolerance/correction mechanism. 3. Power Efficiency bottleneck: too much Energy is spent on ADCs... ## Idea Can we use the precision-energy trade-off to make great improvement for power efficiency on ADCs? <AEPE: An Area and Power Efficient RRAM Crossbar-based Accelerator for Deep CNNs> (also improved on-chip network/eDRAM and some other aspects) # 10. SecNDP https://ieeexplore.ieee.org/document/9773244 ## Summary The paper introduces SecNDP, a lightweight encryption and verification scheme that supports computation over ciphertext and verifies the correctness of linear operations. The architecture leverages secure Multi-Party Computation (MPC) and counter-mode encryption to reduce decryption latency and support computation over encrypted data stored in memory or storage. SecNDP significantly reduces memory bandwidth usage while providing security guarantees and can be implemented without changing the near-data processing (NDP) protocols and their inherent hardware design. The paper demonstrates SecNDP's performance benefits, accuracy, and energy savings for two real-world data-intensive use cases, showing that it can match the speedup delivered by unprotected NDP. The authors show that SecNDP enables a Trusted Execution Environment (TEE) in the presence of untrusted memory to leverage the performance and energy benefits of NDP securely. ## Strengths 1. High Performance and Energy Efficiency: The paper shows that SecNDP significantly reduces memory bandwidth usage while providing security guarantees, and can match the speedup delivered by unprotected NDP. 2. Demonstrated Performance Benefits: The authors show SecNDP's performance benefits and accuracy on two real-world data-intensive use cases. 3. No Modification of NDP Protocols: The SecNDP architecture can be implemented without changing the NDP protocols and their inherent hardware design. 4. No Additional Security Assumptions: SecNDP does not require any security assumptions on NDP to hold and does not require the non-collusion assumption to hold among NDP PUs. ## Weaknesses 1. Processor Changes Required: The SecNDP architecture requires some small changes in the processor, which could limit its applicability in certain scenarios. 2. Limitations of Arithmetic Secret Sharing: If the original data is in a floating-point format, it must be quantized into fixed-point numbers or integers, which may not be suitable for all workloads. ## Ideas 1. Computation Beyond Addition and Multiplication: It's unclear whether the SecNDP system will work if the computation goes beyond addition and multiplication or if non-add/multiply operands need to be transferred into add-multiply format. 2. Security-Performance Trade-Off: The paper may have some hidden security-performance trade-offs, such as the use of counter-mode encryption, which is not the safest mode. Nonlinear hashing methods, such as MD5, may be more secure than linear ones, like the one used in SecNDP. # 11. Dynamic multi-resolution data storage https://dl.acm.org/doi/pdf/10.1145/3352460.3358282Links to an external site. summary Varifocal Storage is a dynamic multiresolution storage system that improves the performance, quality, flexibility, and cost of computer systems for diverse application demands. It tackles the challenges of approximate computing by dynamically adjusting the dataset resolution within a storage device. Varifocal Storage introduces Autofocus and iFilter mechanisms to provide quality control and flexibility, and supports approximate and exact computing without incurring the costs of conventional storage systems. Varifocal Storage leverages the existing Process-in-storage capability of SSDs and requires minimal additional software or hardware modifications. Through experiments, Varifocal Storage demonstrates a significant reduction in overhead and latency, and an overall speedup of 1.52x compared to conventional approximate-computing architectures. Varifocal Storage's flexibility and scalability make it suitable for handling different data types and structures. The paper provides a novel approach to approximate computing that optimizes the utilization of intelligent memory while maintaining performance and quality control. ## Strengths 1. The paper recognizes that raw data (high resolution) transmission is a bottleneck in approximate computing and addresses this issue with Varifocal Storage, which dynamically adjusts the dataset resolution within a storage device. 2. Varifocal Storage fully utilizes the existing Process-in-storage capability of SSDs, minimizing the need for additional software or hardware modifications. 3. Varifocal Storage is highly scalable and supports the definition of new operands, allowing for flexibility in handling user-defined situations. ## Weaknesses 1. This paper believes that "A small subset of input data is representative of the rest of the input data in approximate-computing applications that tolerate inaccuracies" and consequently uses the first 8 pages for quality control. However, there may be corner cases where this assumption does not hold. For example, in sparse datasets, the first pages of the data may be all zeros, which can satisfy any quality control mechanisms but do not accurately represent the rest of the dataset. Therefore, while the quality control mechanisms described in the first eight pages of the paper are generally effective, further research may be necessary to determine their effectiveness in more complex datasets. ## Idea In this class, the 10 articles we have read in terms of PIM often propose entirely new hardware architectures (e.g. ASIC) or new mechanisms(e.g. cache compute) to accelerate tasks. In contrast, this paper only utilizes the computing power of existing intelligent memory, which is inspiring and also probably easy to be widely used in practice. # 12. INSIDER https://www.usenix.org/system/files/atc19-ruan_0.pdf ## Summary The paper introduces Insider, a framework for designing in-storage computing systems for high-performance drives. Insider provides a flexible, programmable approach to computing that is integrated directly into the storage system. This framework enables custom compute modules to be designed and integrated into storage devices, offering more efficient data access and programming through higher-level languages. Insider is designed to support a variety of applications, including data analytics and machine learning, and offers advantages such as lower power consumption, reduced data movement, and higher throughput compared to traditional computing systems. However, the framework requires hardware and software modifications to support in-storage computing, which may limit its adoption by some storage vendors. Despite this limitation, Insider provides a promising approach to improving the performance and efficiency of storage computing systems. ## Strengths 1. High performance. INSIDER successfully crosses the “data movement wall” and fully utilizes the high drive performance. 2. End to end design, including software support. INSIDER provides simple but effective abstractions for programmers and offers necessary system support which enables a shared executing environment. 3. Open sourced. ## Weaknesses 1. Lack of detail comparison with other FPGA-based In-Storage-Processing systems. Mainly compared with ARM cores. 2. Lack of concurrency control(lock) for some race condition. ## Idea This paper presents a substantial and comprehensive work that comprises approximately **14K** lines of code (written in C++, HDL, and script languages), which indicates how difficult it is to build a new system. The advantages claimed in the beginning parts of the paper seem to be the inherent advantages of FPGA over ASIC. For me, I'm more interesting about the design wisdom(e.g. Pipelining Overlapping, FIFO and round-robin scheduler and so on).