# Cognitive SSD: A Deep Learning Engine for In-Storage Data Retrieval ###### tags: `Openssd` `Paper` `NMC` ## Traditional arch. vs Cognitivie SSD ![](https://i.imgur.com/7vmHD9w.png) - When a data retrieval request arrives from the internet or the central server, the CPU has to reload massive potential data from disk into the temporary DRAM [14] and match the features of the query with those of the loaded unstructured data to find the relevant targets - The current I/O software stack significantly burdens the data retrieval system when it simply fetches data from the storage devices on retrieval requests - Massive data movement incurs energy and latency overhead in the conventional memory hierarchy - retrieval requests are directly sent to the storage devices, and the target data analysis and indexing are completely performed where the unstructured data resides ## Contributions - propose Cognitive SSD, to enable within-SSD deep learning and graph search by integrating a specialized deep learning and graph search accelerator (DLG-x). - The DLGx directly accesses data from NAND flash without crossing multiple memory hierarchies to decrease data movement path and power consumption - Cognitive SSD to build a serverless data retrieval system, which completely abandons the conventional data query mechanism in orthodox compute-centric systems. - reduces the hardware and power overhead of large-scale storage nodes in data centers. ## Unstructured Data Retrieval System ![](https://i.imgur.com/drN5fFe.png) ## Cognitive SSD System ![](https://i.imgur.com/p7jpmmm.png) ![](https://i.imgur.com/c672ZQ2.png) ## The Cognitive SSD software:DLG Library ### Configuration Library - the configuration library provides a DLG-x compiler compatible with popular deep learning frameworks (i.e., Caffe) - allow the administrator to train the new deep learning model and generate corresponding DLG-x instructions offline - the administrator can update the learning model running on the Cognitive SSD by updating the DLG-x instructions. - The updated instructions are sent to the instruction area allocated in the NAND flash - stay there until a model change command (DLG_config) is issued. - the DLG-x compiler also reorganizes the data layout of the DLG algorithm to fully utilize the internal flash bandwidth - The physical address of weight and graph structure information is recorded in the DLG-x instruction. - the DLG-x obtains the physical address of required data directly at runtime ### User Library #### Data plane - provides SSD_read and SSD_write APIs for users to control data transmission between the host server and the Cognitive SSD. - These two commands operate directly on the physical address bypassing the flash translation layer #### Task plane - Three APIs in the task plane of user library: DLG_hashing, DLG_index, and DLG_analysis - C0h, C1h, and C2h commands of NVMe I/O protocol - All of them possess two basic parameters carried by NVMe protocol DWords - the data address indicating the data location in Cognitive SSD, and the data size in bytes ### Cognitive SSD runtime - managing the incoming extended I/O command via PCIe interface - converts the API-related commands into machine instructions for the DLG-x accelerator - includes a request scheduler and the basic firmware - The request scheduler contains three modules: the DLG task scheduler, the I/O scheduler, and the DLG configurator - The DLG task scheduler responds to users requests as supported in the task plane and initiates the corresponding task session in Cognitive SSD - The I/O scheduler dispatches I/O requests to the basic firmware or the DLG-x - The DLG configurator receives DLG_config commands from the host and updates the instructions generated by the compiler and parameters of the specified deep learning model for Cognitive SSD ## Hardware Architecture : Cognitive SSD - It is composed of an embedded processor running the Cognitive SSD runtime, a DLG-x accelerator and NAND flash controllers connected to flash chips. - Though SSDs often have compact DRAM to cache data or metadata, the internal DRAM capacity can hardly satisfy the demand of the deep learning - DLG-x to read and write the related working data directly from NAND flash, bypassing the internal DRAM. ### The Procedure of data retrieval in Cognitive SSD - assume that the hardware instruction and parameters of Hash-AlexNet model have been generated and written to the corresponding region by leveraging the DLG-x compiler and DLG_config command - host DLG library captures a retrieval request - package and write the user input data from the designated host memory space to Cognitive SSD through SSD_write - DLG_hashing command carrying the address of input data is sent to Cognitive SSD for hash code generation - the request scheduler of the Cognitive runtime parses it and notifies the DLG-x accelerator to start a hashing feature extraction session - DLG-x automatically fetches input query data from the command-specified data address and then loads deep learning parameters from NAND flash - Meanwhile, the other command, DLG_index, is sent and queued by the task scheduler. - After the hash code is produced, the DLG_index is dispatched to invoke the graph search function in the DLG-x and uses the hash result to search the data graphs for relevant data entries. ## DLG-x Accelerator - DLG-x accelerator is designed to directly obtain the majority of the working-set data from NAND flash ![](https://i.imgur.com/DVermz4.png) ### Data Flow - when a request arrives at the DLG-x accelerator - the input data and the first kernel of the first convolution layer is transferred in parallel to the InOut buffer-0 and the weight buffer-0 - the DLG-x accelerator begins to compute the output feature map and stores them into the InOut buffer-1. - When the first kernel is processed, the second kernel is being transferred from the NAND flash to weight buffer-1