# Extension
## Top 10 frequency word
I had been creating an algorithm but it seems not correct and not as fast as optimization insertion sort ......
## .tar Word Counting
[Github](https://github.com/ilkclord/Extension-of-wordcount)
### .tar
* **Structue**
Each chunk of .tar is 512 * 8 bytes and if the file size isn't the mutilple of 512 * 8 , the empty memory will be filled with `^@` aka **NULL character** with ascii code 0 .
* **header**
Each file has a header to store the information of each file
```clike
struct posix_header
{ /* byte offset */
char name[100]; /* 0 */
char mode[8]; /* 100 */
char uid[8]; /* 108 */
char gid[8]; /* 116 */
char size[12]; /* 124 */
char mtime[12]; /* 136 */
char chksum[8]; /* 148 */
char typeflag; /* 156 */
char linkname[100]; /* 157 */
char magic[6]; /* 257 */
char version[2]; /* 263 */
char uname[32]; /* 265 */
char gname[32]; /* 297 */
char devmajor[8]; /* 329 */
char devminor[8]; /* 337 */
char prefix[155]; /* 345 */
/* 500 */
};
```
* **content**
```clike
char [512]
```
* **Fetch the file information**
we define a data structure correspond to each chunk , aim to mmap perfectly into the program .
```clike
struct tar {
char mem[512] ;
}typedef tar;
```
From the header , it is easy to get the file name and the size of each , if we get the size of each , we can find the next header quickly by calculating which chunk index .
* **Sample code**
```clike
int create_src(char * pwd){
char cmd[50] ;
strcpy(cmd , "tar -C src/ -xf ") ;
strcat(cmd , pwd) ;
int fd = open("src/" , O_RDONLY) ;
if(fd == -1 ){
system("mkdir src/") ;
}
if(system(cmd) == -1) return 0 ;
return 1 ;
}
int s2int(char * s , int size){
int r = 0 ;
int l = size ;
for(int i = l - 2 ; i >= 0 ; i--){
int num = s[i] - 48 ;
if(num > 0 && num < 10)
r = r + (s[i] - 48) * pow(8 ,(l - 2 - i)) ;
}
return r ;
}
int next_header(tar *t , int * sz){
int next = 1 ;
char size[12] ;
strncpy( size, &(t->mem[124]),12) ;
int contents = s2int(size , 12) ;
*sz = contents;
int add = 0 ;
if(contents % 512 != 0)
add = 1 ;
if(contents)
return contents / 512 + add + next ;
else
return 0 ;
}
```
Since the size in .tar is Octet , the value of nth digit is d * 8^n^ ,after converting , we can calculate the next one's position .The `create_src` is to unpack the .tar into a file , which can prevent tar bomb (found at wiki ) .
Here are some method that I had come into mind to do word-counting in mutiple file .
### Version 1
If we get a .tar file , we first mmap the .tar file and use the algorithm above to get the list of the file name . Then we call the word-count function to count for each , word-count funtion here is a function version of word-count.exe .
* **start_count**
```clike
int count_start(queue *q ){
job * j ;
int work = threadn ;
queue_r(q , &j) ;
if(!j){
perror("wrong job") ;
exit(EXIT_FAILURE) ;
}
if(!word_count( work , j->data->pwd , 1)){
perror("first word_count error") ;
exit(EXIT_FAILURE) ;
}
job_destroy(j) ;
while(atomic_load(&q->entry) > 0){
queue_r(q , &j) ;
if(j){
if(!word_count( work , j->data->pwd , 0)){
perror("duration word_count error") ;
exit(EXIT_FAILURE) ;
}
job_destroy(j) ;
}
}
return 1 ;
}
```
* **word-count function**
```clike
int word_count( int n , char * pwd , int first)
{
n_threads = n ;
file_name = pwd ;
printf("%d %d %s\n" ,first, n_threads , pwd) ;
if(mr_init(first) == -1) { perror("wrong init\n") ; exit(EXIT_FAILURE) ;}
run_threads();
wait_threads();
if (mr_reduce()) return 0 ;
//if (mr_print()) exit(EXIT_FAILURE);
file_content = NULL ;
munmap(NULL ,file_size) ;
printf("done word_count\n") ;
return 1 ;
}
```
* **Program Flowchart**
```graphviz
Digraph G{
"Get the file name list" -> "init for once" ->"Do the word-count" ->"print out result" -> "free the memory"
}
```
There are only one initializations to do for each time , which preserve the time for malloc .
* **counter initialize**
When merging , for same word in diffrent cache , we preserve only the node exist in main cache , for the other we add the counter to main node and set its counter to 0
* **execution**
```clike
gcc tar-word-count.c -o tar-word-count -lpthread -lm
./tar-word-count .tar thead_n
```
Result
```clike
>> ./tar-word-count a.tar 4
Words: 14028, word counts: 209040, full buckets: 1243 (ideal 8192)
Operating a.tar with 3 files inside
Done in 52.479 msec
```
The execution time of program is slightly longer than the sum of word-counting for each file.
* **Bug**
And there exist a bug but I can't find out where goes wrong .
* **Mutiple file but has a big difference in file size**
When I was testing , when two file's size has a big diffrence , A is 443100 and B is 64 , the program cause an invalid access .
```clike
==8147== Thread 3:
==8147== Invalid read of size 8
==8147== at 0x109EC3: ht_find (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10A599: wc_add_word (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10B060: add_sep (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10B0D2: buff_proceed (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10B49E: mr_map (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x4869608: start_thread (pthread_create.c:477)
==8147== by 0x4AF4292: clone (clone.S:95)
==8147== Address 0x4be2370 is 107,632 bytes inside an unallocated block of size 4,186,336 in arena "client"
==8147==
==8147== Invalid read of size 8
==8147== at 0x109ECE: ht_find (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10A599: wc_add_word (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10B060: add_sep (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10B0D2: buff_proceed (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10B49E: mr_map (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x4869608: start_thread (pthread_create.c:477)
==8147== by 0x4AF4292: clone (clone.S:95)
==8147== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==8147==
==8147==
==8147== Process terminating with default action of signal 11 (SIGSEGV)
==8147== Access not within mapped region at address 0x0
==8147== at 0x109ECE: ht_find (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10A599: wc_add_word (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10B060: add_sep (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10B0D2: buff_proceed (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x10B49E: mr_map (in /home/ilkclord/linux2020/quiz8/project/Extension-of-wordcount/pof)
==8147== by 0x4869608: start_thread (pthread_create.c:477)
==8147== by 0x4AF4292: clone (clone.S:95)
```
### Version 2
Plugging a thread pool , and handle the mapping / reduce operation
1 . get the list of the file
2 . divide the file according to the size
3 . follows main pool's order , run the mr_map or reduce function
### Version 3
MapReduce model , but the fast key has to be implement fisrt
1 . get the list of the file
2 . operate mutiple files by parallely calling word-counting for each
### Conclude
For mutiple file input , i think MapReduce is still works . When calling the function word-count , we can specified the thread number , so for the smaller file , we call less thread to operate it .But I think version 2 , which plugs in a thread pool will be the best way . I will implement it as soon as possible .
## Reference
[GNU Basic Tar Format](https://www.gnu.org/software/tar/manual/html_node/Standard.html)