# Cognitive SSD: A Deep Learning Engine for In-Storage Data Retrieval
###### tags: `Openssd` `Paper` `NMC`
## Traditional arch. vs Cognitivie SSD

- When a data retrieval request arrives from the internet or the central server, the CPU has to reload massive potential data from disk into the temporary DRAM [14] and match the features of the query with those of the loaded unstructured data to find the relevant targets
- The current I/O software stack significantly burdens the data retrieval system when it simply fetches data from the storage devices on retrieval requests
- Massive data movement incurs energy and latency overhead in the conventional memory hierarchy
- retrieval requests are directly sent to the storage devices, and the target data analysis and indexing are completely performed where the unstructured data resides
## Contributions
- propose Cognitive SSD, to enable within-SSD deep learning and graph search by integrating a specialized deep learning and graph search accelerator (DLG-x).
- The DLGx directly accesses data from NAND flash without crossing multiple memory hierarchies to decrease data movement path and power consumption
- Cognitive SSD to build a serverless data retrieval system, which completely abandons the conventional data query mechanism in orthodox compute-centric systems.
- reduces the hardware and power overhead of large-scale storage nodes in data centers.
## Unstructured Data Retrieval System

## Cognitive SSD System


## The Cognitive SSD software:DLG Library
### Configuration Library
- the configuration library provides a DLG-x compiler compatible with popular deep learning frameworks (i.e., Caffe)
- allow the administrator to train the new deep learning model and generate corresponding DLG-x instructions offline
- the administrator can update the learning model running on the Cognitive SSD by updating the DLG-x instructions.
- The updated instructions are sent to the instruction area allocated in the NAND flash
- stay there until a model change command (DLG_config) is issued.
- the DLG-x compiler also reorganizes the data layout of the DLG algorithm to fully utilize the internal flash bandwidth
- The physical address of weight and graph structure information is recorded in the DLG-x instruction.
- the DLG-x obtains the physical address of required data directly at runtime
### User Library
#### Data plane
- provides SSD_read and SSD_write APIs for users to control data transmission between the host server and the Cognitive SSD.
- These two commands operate directly on the physical address bypassing the flash translation layer
#### Task plane
- Three APIs in the task plane of user library: DLG_hashing, DLG_index, and DLG_analysis
- C0h, C1h, and C2h commands of NVMe I/O protocol
- All of them possess two basic parameters carried by NVMe protocol DWords
- the data address indicating the data location in Cognitive SSD, and the data size in bytes
### Cognitive SSD runtime
- managing the incoming extended I/O command via PCIe interface
- converts the API-related commands into machine instructions for the DLG-x accelerator
- includes a request scheduler and the basic firmware
- The request scheduler contains three modules: the DLG task scheduler, the I/O scheduler, and the DLG configurator
- The DLG task scheduler responds to users requests as supported in the task plane and initiates the corresponding task session in Cognitive SSD
- The I/O scheduler dispatches I/O requests to the basic firmware or the DLG-x
- The DLG configurator receives DLG_config commands from the host and updates the instructions generated by the compiler and parameters of the specified deep learning model for Cognitive SSD
## Hardware Architecture : Cognitive SSD
- It is composed of an embedded processor running the Cognitive SSD runtime, a DLG-x accelerator and NAND flash controllers connected to flash chips.
- Though SSDs often have compact DRAM to cache data or metadata, the internal DRAM capacity can hardly satisfy the demand of the deep learning
- DLG-x to read and write the related working data directly from NAND flash, bypassing the internal DRAM.
### The Procedure of data retrieval in Cognitive SSD
- assume that the hardware instruction and parameters of Hash-AlexNet model have been generated and written to the corresponding region by leveraging the DLG-x compiler and DLG_config command
- host DLG library captures a retrieval request
- package and write the user input data from the designated host memory space to Cognitive SSD through SSD_write
- DLG_hashing command carrying the address of input data is sent to Cognitive SSD for hash code generation
- the request scheduler of the Cognitive runtime parses it and notifies the DLG-x accelerator to start a hashing feature extraction session
- DLG-x automatically fetches input query data from the command-specified data address and then loads deep learning parameters from NAND flash
- Meanwhile, the other command, DLG_index, is sent and queued by the task scheduler.
- After the hash code is produced, the DLG_index is dispatched to invoke the graph search function in the DLG-x and uses the hash result to search the data graphs for relevant data entries.
## DLG-x Accelerator
- DLG-x accelerator is designed to directly obtain the majority of the working-set data from NAND flash

### Data Flow
- when a request arrives at the DLG-x accelerator
- the input data and the first kernel of the first convolution layer is transferred in parallel to the InOut buffer-0 and the weight buffer-0
- the DLG-x accelerator begins to compute the output feature map and stores them into the InOut buffer-1.
- When the first kernel is processed, the second kernel is being transferred from the NAND flash to weight buffer-1