# Extension ## Top 10 frequency word I had been creating an algorithm but it seems not correct and not as fast as optimization insertion sort ...... ## .tar Word Counting [Github](https://github.com/ilkclord/Extension-of-wordcount) ### .tar * **Structue** Each chunk of .tar is 512 * 8 bytes and if the file size isn't the mutilple of 512 * 8 , the empty memory will be filled with `^@` aka **NULL character** with ascii code 0 . * **header** Each file has a header to store the information of each file ```clike struct posix_header { /* byte offset */ char name[100]; /* 0 */ char mode[8]; /* 100 */ char uid[8]; /* 108 */ char gid[8]; /* 116 */ char size[12]; /* 124 */ char mtime[12]; /* 136 */ char chksum[8]; /* 148 */ char typeflag; /* 156 */ char linkname[100]; /* 157 */ char magic[6]; /* 257 */ char version[2]; /* 263 */ char uname[32]; /* 265 */ char gname[32]; /* 297 */ char devmajor[8]; /* 329 */ char devminor[8]; /* 337 */ char prefix[155]; /* 345 */ /* 500 */ }; ``` * **content** ```clike char [512] ``` * **Fetch the file information** we define a data structure correspond to each chunk , aim to mmap perfectly into the program . ```clike struct tar { char mem[512] ; }typedef tar; ``` From the header , it is easy to get the file name and the size of each , if we get the size of each , we can find the next header quickly by calculating which chunk index . * **Sample code** ```clike int create_src(char * pwd){ char cmd[50] ; strcpy(cmd , "tar -C src/ -xf ") ; strcat(cmd , pwd) ; int fd = open("src/" , O_RDONLY) ; if(fd == -1 ){ system("mkdir src/") ; } if(system(cmd) == -1) return 0 ; return 1 ; } int s2int(char * s , int size){ int r = 0 ; int l = size ; for(int i = l - 2 ; i >= 0 ; i--){ int num = s[i] - 48 ; if(num > 0 && num < 10) r = r + (s[i] - 48) * pow(8 ,(l - 2 - i)) ; } return r ; } int next_header(tar *t , int * sz){ int next = 1 ; char size[12] ; strncpy( size, &(t->mem[124]),12) ; int contents = s2int(size , 12) ; *sz = contents; int add = 0 ; if(contents % 512 != 0) add = 1 ; if(contents) return contents / 512 + add + next ; else return 0 ; } ``` Since the size in .tar is Octet , the value of nth digit is d * 8^n^ ,after converting , we can calculate the next one's position .The `create_src` is to unpack the .tar into a file , which can prevent tar bomb (found at wiki ) . Here are some method that I had come into mind to do word-counting in mutiple file . ### Version 1 If we get a .tar file , we first mmap the .tar file and use the algorithm above to get the list of the file name . Then we call the word-count function to count for each , word-count funtion here is a function version of word-count.exe . * **start_count** ```clike int count_start(queue *q ){ job * j ; int work = threadn ; queue_r(q , &j) ; if(!j){ perror("wrong job") ; exit(EXIT_FAILURE) ; } if(!word_count( work , j->data->pwd , 1)){ perror("first word_count error") ; exit(EXIT_FAILURE) ; } job_destroy(j) ; while(atomic_load(&q->entry) > 0){ queue_r(q , &j) ; if(j){ if(!word_count( work , j->data->pwd , 0)){ perror("duration word_count error") ; exit(EXIT_FAILURE) ; } job_destroy(j) ; } } return 1 ; } ``` * **word-count function** ```clike int word_count( int n , char * pwd , int first) { n_threads = n ; file_name = pwd ; printf("%d %d %s\n" ,first, n_threads , pwd) ; if(mr_init(first) == -1) { perror("wrong init\n") ; exit(EXIT_FAILURE) ;} run_threads(); wait_threads(); if (mr_reduce()) return 0 ; //if (mr_print()) exit(EXIT_FAILURE); file_content = NULL ; munmap(NULL ,file_size) ; printf("done word_count\n") ; return 1 ; } ``` * **Program Flowchart** ```graphviz Digraph G{ "Get the file name list" -> "init for once" ->"Do the word-count" ->"print out result" -> "free the memory" } ``` There are only one initializations to do for each time , which preserve the time for malloc . * **counter initialize** When merging , for same word in diffrent cache , we preserve only the node exist in main cache , for the other we add the counter to main node and set its counter to 0 * **execution** ```clike gcc tar-word-count.c -o tar-word-count -lpthread -lm ./tar-word-count .tar thead_n ``` Result ```clike >> ./tar-word-count a.tar 4 Words: 14028, word counts: 209040, full buckets: 1243 (ideal 8192) Operating a.tar with 3 files inside Done in 52.479 msec ``` The execution time of program is slightly longer than the sum of word-counting for each file. * **Bug** And there exist a bug but I can't find out where goes wrong . * **Mutiple file but has a big difference in file size** When I was testing , when two file's size has a big diffrence , A is 443100 and B is 64 , the program cause an invalid access . ```clike ==8147== Thread 3: ==8147== Invalid read of size 8 ==8147== at 0x109EC3: ht_find (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10A599: wc_add_word (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10B060: add_sep (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10B0D2: buff_proceed (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10B49E: mr_map (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x4869608: start_thread (pthread_create.c:477) ==8147== by 0x4AF4292: clone (clone.S:95) ==8147== Address 0x4be2370 is 107,632 bytes inside an unallocated block of size 4,186,336 in arena "client" ==8147== ==8147== Invalid read of size 8 ==8147== at 0x109ECE: ht_find (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10A599: wc_add_word (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10B060: add_sep (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10B0D2: buff_proceed (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10B49E: mr_map (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x4869608: start_thread (pthread_create.c:477) ==8147== by 0x4AF4292: clone (clone.S:95) ==8147== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==8147== ==8147== ==8147== Process terminating with default action of signal 11 (SIGSEGV) ==8147== Access not within mapped region at address 0x0 ==8147== at 0x109ECE: ht_find (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10A599: wc_add_word (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10B060: add_sep (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10B0D2: buff_proceed (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x10B49E: mr_map (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof) ==8147== by 0x4869608: start_thread (pthread_create.c:477) ==8147== by 0x4AF4292: clone (clone.S:95) ``` ### Version 2 Plugging a thread pool , and handle the mapping / reduce operation 1 . get the list of the file 2 . divide the file according to the size 3 . follows main pool's order , run the mr_map or reduce function ### Version 3 MapReduce model , but the fast key has to be implement fisrt 1 . get the list of the file 2 . operate mutiple files by parallely calling word-counting for each ### Conclude For mutiple file input , i think MapReduce is still works . When calling the function word-count , we can specified the thread number , so for the smaller file , we call less thread to operate it .But I think version 2 , which plugs in a thread pool will be the best way . I will implement it as soon as possible . ## Reference [GNU Basic Tar Format](https://www.gnu.org/software/tar/manual/html_node/Standard.html)