# ETL processing **Development Environment:** * OS : Ubuntu 18.04.1 * CPU: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz * Menory: 16G * Programming Language(version): C++ # program execution time * Input file size: 1.1GB * Execution time: 24.5sec * Measurement basis: wall time ![](https://i.imgur.com/HnT4zVL.png) # Program development and usage instructions ## Program development **The program is divided into three parts** 1. Load File and data pre-processing 2. Transfer data 3. output First, use fgets to read the file from input.csv, then use the function strtok() to split the file, and then store it in a two-dimensional vector one by one. ```cpp= while(fgets(buf, 1000, in)){ const char *d = "|"; char *p = strtok(buf,d); vector<int> tmp; for(int i = 0; i < 20; i++){ tmp.push_back(atoi(p)); p=strtok(NULL,d); } thread_out.num.push_back(tmp); } ``` Next is where multi-thread comes in handy. I will first calculate that a thread needs to process several sets of data (one set = one data), and then each thread processes the corresponding data separately and saves them in a vector. Prepare for output, because the data of different sections are read separately and the memory of a specific section is not changed at the same time, so no lock is required. ```cpp= void thread_out::out_file(int turns, int begin, int end){ stringstream tmp; for(int i = begin; i < end; i++){ tmp << "\t{\n"; for(int j = 0; j < 20; j++){ j != 19 ? tmp << "\t" << "\t" << "\"col_" << j+1 << "\":" << num[i][j] << ", \n" :tmp << "\t" << "\t" << "\"col_" << j+1 << "\":" << num[i][j] << "\n"; } if(i != num. size() - 1) tmp << "\t},\n"; else tmp << "\t}\n"; } res[turns] = tmp.str(); } ``` Finally, just output the res vector in order~ ## Instructions for use * asset measurement generation ``` g++ -o csv csv_gen.cpp -O2 ./csv 5000000 ``` In terms of capital measurement, it is good to see how big the given parameters are when executing directly, 5000000 is about 1G. The reason for optimizing with a compiler is just to make it run faster. * multi-threads ``` g++ -std=c++11 -pthread -o thread thread.cpp ./thread 10 ``` This program is also how many threads to get out of the given parameters when it is executed. # Performance Analysis The first is the execution time of the program: ![](https://i.imgur.com/KNh17HI.png) The calculation here is based on Wall time, because clock() will calculate the time the CPU occupies the core, so if you use two cores for 1 second each, it will be counted as 2 seconds, and the performance analysis will be biased. Let's analyze the hardware situation of each stage during execution ![](https://i.imgur.com/3RKCShS.png) It can be seen that only one core will be full when reading the file because it is a single thread. In terms of memory, I read the files to the memory first and then operate, so within the allowable range, the maximum resources should be available. Then to the stage of data processing ![](https://i.imgur.com/ZHAgmuQ.png) Because of the relationship between storing data in the buffer, the memory usage seems to be high, and at this stage because of the use of multi-threads, all core usage is full, and the CPU also soars to more than 700%, which highlights the OS at this stage. The division of labor and CPU usage. ![](https://i.imgur.com/FNJcPn9.png) The last stage returns to single thread, so only one core is full. This stage is mainly for output, so you can see that the memory usage will be >= the previous stage. ![](https://i.imgur.com/m61ZmM5.png) In addition, you can also see that htop is working with more programs, and there may be no problem when the file size is not large, but if the file size is too large and too many threads are opened, the memory resources will be exhausted, and it will be killed in the end. It will be mentioned later. * Compare the difference in performance between different numbers of Threads > All take 5 million records (1.1G) as input 1. threads = 1 ![](https://i.imgur.com/NTSTFBd.png) 2. threads = 10 ![](https://i.imgur.com/Mt9IJ1U.png) 3. threads = 100 ![](https://i.imgur.com/DVxV21u.png) 4. threads = 1000 ![](https://i.imgur.com/mONOs52.png) Because the comparison is based on the number of Threads, I first use context switch and CPU migrations for comparison. It can be seen that from Threads = 1 to Threads = 1000, the number of context switches increases in a straight line, which proves that the more tasks are waiting to be executed, the more tasks will be switched between tasks. . In addition, from the perspective of execution time, the performance of 1 execution sequence to 10 execution sequences has increased significantly! From the point of view of the performance of the core, it is because under the multi-core architecture, the computer can effectively share the workload, so that the performance can be improved. But when the number of Threads far exceeds the core, the ability to share work should be limited, because there are not enough cores to handle the work, the OS should only let the program continue to switch context, but from Threads = 100 and Threads = 1000, it seems that this side does not slow down a lot. I think there are two reasons. One is that the file to be converted is too small, so the thread processing time is almost the same. The second is that there is no share variable in the program. Wait and lock in between, so even if there are a lot of context switches, the time will not be bad even if the work is done in one click. I would like to add that the usage rate of cpu is about the same at 10, 100, and 1000. I think it may be because there is no communication or lock between threads here, so when the number of threads is large, the resources used are actually similar, because there is no workload. Change. The following two methods are used: 1. Use larger files as input 2. Add share variable Let's try to find out whether the number of Threads far exceeds the number of cores will affect performance. * First explore this part of the larger archive: I adjust the input to 3 times the original (3G) to test ![](https://i.imgur.com/8Tx3Kj4.png) Judging from the execution time, the speed has not become much faster, and there is only a gap of more than 0.0 where the thread operates, so the assumption that the file is too small is wrong. I originally wanted to test with a larger file, but the program will be killed by the system when using a 4G file, and it becomes the same as the previous parallelization test, but this time I found that I can use dmesg to view the kernel message . after actual operation ![](https://i.imgur.com/J0zi9rk.png) I found out that it is an Out of memory problem. Let me add the part that was not mentioned in the last homework. This kill process is also one of the important functions of the OS. When the memory resources are exhausted by programs, browsers or other software, the OOM killer will be activated. Come out and call select_bad_process() to select a "bad" process to kill. The selection will be performed by oom_badness(), and the job that currently occupies the most resources will be selected for processing. Therefore, when memory resources are really not enough, not necessarily only the executing program will be killed. If you drop, browsers and other things will be scored, so too many tabs in Chrome may also be killed. From the above point of view, it should not be related to the file size. Then let's start by increasing the share variable. I added a share variable to the place where multi-threads operates, so it needs to be managed with a lock. The code is as follows. ```cpp void thread_out::out_file(int turns, int begin, int end){ stringstream tmp; for(int i = begin; i < end; i++){ tmp << "\t{\n"; for(int j = 0; j < 20; j++){ tmp << "\t" << "\t" << temp(j, num[i][j]); } if(i != num. size() - 1) tmp << "\t},\n"; else tmp << "\t}\n"; } while(turns != flag) ; output_buffer << tmp.str(); flag++; } ``` Among them, ouput_buffer is a share variable to store all the results and output them again, so a spin-lock is needed to control the order of writing. ![](https://i.imgur.com/ZdavsWG.png) Afterwards, the execution time really increased significantly, but the whole program code only increased the while loop. The reason should be obvious. Because there is no interference between different threads at the beginning, completely parallel operation, but after the block of the while loop, a thread must wait for the previous thread to finish executing before continuing to operate. This waiting time will increase with the number of threads The more it is, the more it will rise in a straight line, so the performance will get worse and worse. From the perspective of other hardware performance * thread = 10 ![](https://i.imgur.com/lLDXXNf.png) * thread = 100 ![](https://i.imgur.com/MEOCxdZ.png) * thread = 1000 ![](https://i.imgur.com/76F6tJV.png) Comparing the context switches before and after, it is not difficult to find that there will be more context switches with share variables. Because of the lock, if the execution sequence sorted later gets the time slot given by the OS first, it will only be blocked by the lock, so that extra time is spent waiting. Furthermore, it can be observed that the CPU usage rate is much higher than that of "no share variable". I speculate that the while loop will occupy the CPU to do things, resulting in a waste of resources. ![](https://i.imgur.com/zO1dMbj.png) After recording the task-clock with perf record, most of the time is occupied by the lock made by the while loop, and the guess is correct. # **Summarize:** According to my observation, under the multi-execution sequence, the OS will automatically allocate the idle cores to the threads that need to work, but if the number of threads is much larger than the number of cores, the OS will use the scheduler to help schedule the work of each thread. , will lead to the increase of context-switch. ![](https://i.imgur.com/mcjs8gQ.png) Perf reocrd also shows that the context switch is caused by the schedule, so the number of threads is not the better, and it is about equal to the number of cores to have the best performance. Also pay attention to the use of share variable, because you must add lock for management, so it is best not to use share variable, so that you can achieve real parallel processing, but if you must use share variable, I think use Mutex will be better than spin-lock, because spin-lock will occupy more CPU resources, while mutex will use less resources, but it will be more troublesome to set. There is also part of the OS for cpu migration. Here, the waiting thread should be queued to another core for action. This should also belong to the optimization service provided by the OS, and it will not be in the same core when there are too many threads. There is a traffic jam inside.