contributed by <Sean1127
>,<baimao8437
>,<Lukechin
>,<xdennisx
>,<paul5566
>,<chenweiii
>
延續上學期 "MapReduce with POSIX Thread" 的成果,強化效能和應用案例
Mathias Brossard 在 5, Oct, 2016 完成的 "A simple C thread pool implementation" 為最原始的版本,包含threadpool.c
,threadpool.h
的實做程式碼與heavy.c
,shutdown.c
,thrdtest.c
測試程式碼
threadpool.[ch]
threadpool_create
: create threadpool_t objectthreadpool_add
: add a new task in the queue of a threadpoolthreadpool_destroy
: stop and destroy a threadpool_t objectthreadpool_free
: free memorythreadpool_thread
: worker thread, infinite loop that grabs a task from threadpool queue and executes itheavy.c
64 thread pool,每個 pool 的 queue size 8192,每個 pool thread count 4
此版本需要修正的地方有:
test/
4 個測試程式都有 memory leakmapreduce 是一種軟體框架(software framework),常用於巨量資料的平行計算
此版本在原有的 threadpool 加上 mapreduce 的功能架構
threadpool.[ch]
threadpool_map
: call threadpool_add
, blocks until all is donethreadpool_map_thread
: map task added to threadpool, fixed partition methodthreadpool_reduce
: call threadpool_add
, blocks until all is donethreadpool_reduce_thread
: reduce task added to threadpoolmapreduce.c
: 新的測試程式碼
此版本需要修正的地方有:
threadpool_map_thread
資料分割的方式是直接把 DATASIZE 除以 THREAD_COUNT 去分配 task,此作法寫死且不能展示 mapreduce 的真正的功效mapreduce.c
的程式架構分層太多,詳細可以先偷看這裡經過老師整理之後,交給同學當作業的程式碼做修正
而此版本是同學開始研究的第 1 個 commit
之後的實驗會依循去年同學的改進流程
<Sean1127
>
當 QUEUE \(< 7\) ,有時會出現錯誤的結果,但都可以順利執行結束,以下節錄自終端機執行輸出:
Pool started with 8 threads and queue size of 5
reduce result = 63395205
Pool started with 8 threads and queue size of 5
reduce result = 130575149
Pool started with 8 threads and queue size of 5
Pool started with 8 threads and queue size of 5
reduce result = 21171191
可以看到執行的結果明顯錯誤,而且有時根本無法顯示結果(shutdown)
原因來自 threadpool_add
中對於 queue full 的處理,如果滿了將會直接回傳 -1,讓程式中止
當 QUEUE \(\ge 8\) 的時候,程式可正常結束且結果也正確
根據不同 QUEUE size,map 與 reduce 的執行時間(執行 10 次取平均,每次都先清空 cache):
map
的時間遠大於 reduce
(e.g. QUEUE = 265, map: 0.004424, reduce: 0.000018)兩次改進的內容請參考這裡
詳細請看這裡
threadpool_create
詳細請看這裡
threadpool_reduce
這個版本是從 "handle full queue exception" 開始分支的 branch,所以改良的地方跟計時比較難比較
詳細請看這裡
詳細請看這裡
tests/mapreduce.c
的運行結果不如預期,在 #define DATASIZE (2000000)
的情況下,執行時間如下…可見不但沒有精進反而還有變慢的跡象。…"(引用自上一組同學),雖然說跟原始版本沒有進步,但他們所指的原始版本是沒有 lock free 的最新版本,而不是連質數演算法都沒有改的最原始版本此版本改寫的程式碼非常少,時間進步也非常少
atomic operation 用途並不只這邊,但卻是 lock free programming 的一項利器,改善的幅度不大可能是因為 data size 還太小(2,000,000),又或是其他 function 的 overhead 太大
此版本的修改都在測試程式碼,將 critical section 的 mutex 改用 atomic 完成
雖然圖表顯示執行時間增加(4.25 %),但理論上是沒有影響,增加的幅度 < 5 % 亦可忽略
<baimao8437
>,<xdennisx
>
is_simple
的 for loop 裡面第二個 if 判斷式
if (x < 2) return;
if ((x & ~3) && !(x & 1)) return;
for (int i = 3; i * i <= x; i += 2)
if (x % i == 0) return;
原執行時間:
[map] Total time: 0.561102
[is_simple] Total time: 2.219543
[reduce] Total time: 0.009626
改進方法:先把 2 跟 3 的倍數去掉之後,每個質數的組成都是 6k+1
或是 6k-1
,因此把每次 loop 的迭代的間隔變為 +2
、+4
if (x == 2 || x == 3){
data[n] = x;
return;
}
if(!(x % 2) || !(x % 3) || (x < 2)) return;
long i = 5;
long w = 2;
while (i * i <= x){
if (x % i == 0) return;
i += w;
w = 6 - w;
}
時間有些微下降!!
[map] Total time: 0.405432
[is_simple] Total time: 1.620451
[reduce] Total time: 0.008208
再改進:一次判斷兩個 +2
、+4
if(x<=1)
return;
else if (x <=3)
{
data[n] = x;
return;
}
else if (!(x%2)||!(x%3))
return;
long i = 5;
while (i * i <= x)
{
if (!(x % i)||!(x%(i+2)))
return;
i=i+6;
}
結果效率提升41.1%
[map] Total time: 0.347067
[is_simple] Total time: 1.307372
[reduce] Total time: 0.010900
暫時不做
contributed by <chenweiii
>
Performance counter stats for './mapreduce' (10 runs):
4,2207 cache-misses # 9.791 % of all cache refs ( +- 14.32% )
43,1070 cache-references ( +- 1.87% )
195,7357,9219 cycles ( +- 0.98% )
171,4108,3879 instructions # 0.88 insn per cycle ( +- 0.00% )
1.131522587 seconds time elapsed ( +- 2.51% )
static void threadpool_map_thread(void *arg)
{
int id = *(int *) arg;
threadpool_map_t *map = (threadpool_map_t *) ((int *) arg - id);
int start = id;
int end = map->size;
int delta = map->thread_count;
for (; start < end; start += delta)
map->routine(start, map->arg);
sem_post(&map->done_indicator);
}
執行時間分析圖待補…。Chen WeiThu, May 11, 2017 8:49 PM
Performance counter stats for './mapreduce' (10 runs):
3,2631 cache-misses # 3.648 % of all cache refs ( +- 11.34% )
89,4584 cache-references ( +- 1.62% )
226,2217,9341 cycles ( +- 2.06% )
216,9198,9598 instructions # 0.96 insn per cycle ( +- 0.00% )
1.287776785 seconds time elapsed ( +- 2.52% )
為了補齊時間的分析,跑了 DATASIZE 從 20000 到 100000
雖然 cache misses 有所改善,但執行時間卻有增加的趨勢
調整了一下 thread_count = 16 ,不知道會不會有比較好的表現
在這個版本,資料的切割仍是以 thread_count 處理 Chen WeiMon, May 15, 2017 3:08 AM
仍不知道為什麼同樣 cache-misses 都有下降,卻會有不同的表現? Chen WeiMon, May 15, 2017 4:14 PM
Performance counter stats for './mapreduce' (10 runs):
4,4955 cache-misses # 2.953 % of all cache refs ( +- 13.58% )
152,2247 cache-references ( +- 2.32% )
273,8929,5990 cycles ( +- 0.45% )
216,9467,8206 instructions # 0.79 insn per cycle ( +- 0.00% )
1.033093284 seconds time elapsed ( +- 1.23% )
contributed by <chenweiii
>, <Lukechin
>
commit 411f3c7
/* threadpool_map */
for (int i = 0; i < pool->thread_count; i++) {
threadpool_error_t _err = threadpool_add(pool, threadpool_map_thread,
&map.personal_pointers[i], flags);
}
/* threadpool_map_thread */
int end = map->size / map->thread_count;
int additional_items = map->size - end * map->thread_count;
int start = end * id;
/* threadpool_map */
for (int i = 0; i < task_num; i++) {
threadpool_error_t _err = threadpool_add(pool, threadpool_map_thread,
&map.personal_pointers[i], flags);
}
/* threadpool_map_thread */
int id = *(int *) arg;
threadpool_map_t *map = (threadpool_map_t *) ((int *) arg - id);
int task_num = map->task_num;
int end = map->size / task_num;
int additional_items = map->size - end * task_num;
int start = end * id;
commit c36f353
但目前實作上,是一對一的 mapping 關係,暫時想不到多對一的好處跟實作方法。 Chen WeiFri, May 12, 2017 3:37 PM
回應老師討論區問題,這邊指的 mapping 關係是指 map 與 reduce 之間,不是指 reduce task 與 worker thread 之間。 Chen WeiSun, May 14, 2017 11:33 PM
for (int i = 0; i < task_num; i++) {
sem_wait(&map.done_indicator);
for (int j = 0; j < task_num; j++) {
if (map.personal_pointers[j] == -1) {
/* add reduce task j */
threadpool_error_t _err = threadpool_add(pool, threadpool_reduce_thread,
&info.personal_pointers[j], flags);
map.personal_pointers[j] = -2;
break;
}
}
}
從 interleaving map 的版本整合
暫時不做
暫時不做
contributed by <Sean1127
>,<Lukechin
>,<chenweiii
>
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 78
Model name: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
Stepping: 3
CPU MHz: 399.932
CPU max MHz: 3000.0000
CPU min MHz: 400.0000
BogoMIPS: 4992.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
將原本 threadpool 替換為我們改良後的 threadpool,並比較其效能。
<paul5566
>
用 merge sort 測試 map reduce 效能
sysprog