# Key-Value Stroages **Development Envirocment:** * OS : Ubuntu 18.04.1 * CPU: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz * Menory: 16G * Programming Language(version): C++ # Program development and usage instructions ## Program development I developed this assignment in C++, and it is divided into three parts. The first and third parts are related to accessing and reading archives, and the second part is mainly for judging PUT, GET, and SCAN. In terms of optimizing memory access, I use LRU as my main algorithm, and optimize the rest of the algorithm based on it. I will explain it later. ## Instructions for use ```bash= #compile g++ -o kvs key_value.cpp #run ./kvs input file path ``` Let me explain here because the teaching assistant does not specify the file name but only the sub-file name, so I use the character in front of . as the name. For example: * 1.input ===> 1.ouput * input.input ===> t.input # performance analysis > Because the size of the test capital is limited, I mainly use my own optimization methods to make reports First of all, I use unorder_map as a container for storing key-values. Here, because unorder_map is out of order, I must add an additional list to know the order between keys before knowing which key to pop. * **LRU algorithm is mainly the same as the figure below** ![](https://i.imgur.com/dVhThuV.jpg) ![](https://i.imgur.com/Ed9YHgq.jpg) My main structure is that when I add a new set of key-value or read a set of key-value, I will push the used element to the front of the whole list so that the element behind must be the least frequently used Used, so when the buffer is full, you can directly pop the last element, and then save it back to the disk.tmp file. But one disadvantage of LRU is that when you pop the value, you have to save the value back to the disk, so when the number of elements increases, the performance will slow down a lot. So I added a dirty bit to the list in order to reduce the operation on the disk. When I PUT, the dirty bit is uniformly true because it is a direct new value that may be different from the value of the disk, so it must be popped Save it back to the disk, but if I GET or SCAN, it may happen that the key is not in the buffer, so I have to save the value back from the disk. At this time, when the key is popped again, I don’t need to save it back to the disk because of memory It is the same as the implementation of the disk. The main purpose here is to judge whether the data in the memory and the disk are inconsistent. If they are inconsistent, write back is required. | key | dirty bit | -------- | -------- | | key1 | true | | key2 | false | | key3 | true | The structure is probably like this Next, my idea is to reduce the operation time of the disk, because I don’t know the position of the key in the disk. If I want to save it back to the disk, I have to slowly retrieve the entire disk file, which will be very slow, so I decided to use a map to save it. Take the corresponding position, the structure is as follows. | key | position | -------- | -------- | | key1 | position1 | | key2 | position2 | | key3 | position3 | ### like this ! ![](https://d2vlcm61l7u1fs.cloudfront.net/media%2Fc48%2Fc4866f29-c66e-4128-b3e7-10ef811fae4b%2FphpeyVCFA.png) In this way, you can directly find the location of the key and save a lot of time. Finally, because there is another situation where the key is not in the buffer or in the disk position map, in this case, it is really necessary to find out whether there is a corresponding key one by one, replace it if there is one, and add it if not. Lastly, my final optimization is trying to improve this. So I divide the disk into nine parts and use the leftmost number to classify, so that when I want to retrieve the disk value step by step, I can spend less time to complete it. The structure is like this | disk1 | disk2 | disk3 | ------ | ----- |----- | | 1232433|2432521|3131521 | 1234526|2435325|3123421 | 1324234|2341413|3923542 By analogy like this, then the disk position will change to record the value of the disk to which the key belongs. Finally, for other optimizations, I tried to use mmap to do file search, but the speed is not faster than using fwrite to write binary files, so all my tmp files are implemented as binary files to get the maximum benefit. The method of temporary storage is to pop all the elements in the list after all are finished to ensure that all the values ​​will be in the disk, and then store the corresponding disk position, so that in the future, you can load data according to the value in the disk position The location value can speed up the load. * **Help from OS** The help brought by the OS this time is limited. I think the OS will only help to speed up when searching the disk.tmp file, and because it involves more io this time, I have done more schedule optimization. Waiting for a question, I think this homework should be able to use multi-thread to write, and use multi-thread to do more PUT when retrieving data, which should be a good optimization.