Heng-Jie Wang
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # 分散式系統架構驗證與效能測試技術報告 ###### tags: `Distributed Learning` 🔙 [**Back To Distributed Learning Notebook**](/1SiMumDORAyQYQ7S_Rp7Ag) :::info [Toc] ::: ## 1. 測試模型架構與對應資料集準備 (4種) ### 1-1. 資料集 #### 1. (Image-based) [MNIST](http://yann.lecun.com/exdb/mnist/) ![](https://i.imgur.com/J5gllD4.png) 無資料集雲端連結,使用 torchvision 下載資料 1. 資料介紹 : 數字辨識 2. 資料類型 : 灰階圖片 ( 28 x 28 )、multiclass ( 0 ~ 9 ) 3. 資料量 : train 60000 筆, test 10000 筆 4. Preprocessing : X #### 2. (Image-based) [Malaria Cell Images Dataset](https://www.kaggle.com/iarunava/cell-images-for-detecting-malaria) | Uninfected | Parasitized | | -------- | -------- | | ![](https://i.imgur.com/vtTgtmb.png) | ![](https://i.imgur.com/ov8xxUH.png) | [資料集雲端連結](https://drive.google.com/drive/folders/1ilQrMdNOBzsdDai-rOriUVZv1NB7D0gL?usp=sharing) 1. 資料介紹 : 瘧疾辨識 2. 資料類型 : 彩色圖片 ( 3 x 100 x 100 )、binary classifier 3. 資料量 : train 24806 筆, test 2754 筆 4. Preprocessing : 原圖大約 ( 150, 150) resize to (100, 100) #### 3. (Image-based) [Retinal OCT Images (optical coherence tomography)](https://www.kaggle.com/paultimothymooney/kermany2018?fbclid=IwAR2UH9efcorDTGqJ9ojYs8l47YKqDEX7TP5XsBC-KpTfFs4cVCW27F7elWU) | NORMAL | CNV | DME | DRUSEN | | -------- | -------- | -------- | -------- | | ![](https://i.imgur.com/ntns4hN.jpg) | ![](https://i.imgur.com/EPKoSMm.jpg) | ![](https://i.imgur.com/JZV0ygs.jpg) | ![](https://i.imgur.com/1SiePOr.jpg) | [資料集雲端連結](https://drive.google.com/drive/folders/1-CpqTwmqPGSBG1BcC9iK9IN9eaTaAanx?usp=sharing) 1. 資料介紹 : 視網膜疾病辨識 2. 資料類型 : 灰階圖片 ( 256 x 256 )、multiclass ( NORMAL, CNV, DME, DRUSEN ) 3. 資料量 : train 32000 筆, test 968 筆 4. Preprocessing : 原圖大約 ( 512, 512 ) resize to (256, 256)、balance train data 成每類 8000 筆 * before ![](https://i.imgur.com/wjugmqj.png) * after ![](https://i.imgur.com/jW90TpW.png) #### 4. (Non-Image-based) [ECG Heartbeat Categorization Dataset](https://www.kaggle.com/shayanfazeli/heartbeat) ![](https://i.imgur.com/a7bnSD4.png) [資料集雲端連結](https://drive.google.com/drive/folders/1gUGHi2oSKF5sPsEINfugL81fXV7-2s5X?usp=sharing) 1. 資料介紹 : 視網膜疾病辨識 2. 資料類型 : 一維資料 ( 1 x 187 )、multiclass ( 'N': 0, 'S': 1, 'V': 2, 'F': 3, 'Q': 4 ) 3. 資料量 : train 81554 筆, test 21892 筆 4. Preprocessing : X ### 1-2. 模型 #### 1. LeNet ( 訓練 MNIST ) 1. [模型參考網站](https://zhuanlan.zhihu.com/p/29716516) 2. [原作論文](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) LeNet 是第一個卷積型網路模型,為 LeCun 於 1989 年所發明。 3. 模型架構 卷積層兩層 + 全連接層三層 架構 : ``` LeNet( (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)) (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1)) (fc1): Linear(in_features=400, out_features=120, bias=True) (fc2): Linear(in_features=120, out_features=84, bias=True) (fc3): Linear(in_features=84, out_features=10, bias=True) ) ``` 參數比例 : ``` name: conv1.weight percentage: 0.24% name: conv1.bias percentage: 0.25% name: conv2.weight percentage: 4.14% name: conv2.bias percentage: 4.17% name: fc1.weight percentage: 81.96% name: fc1.bias percentage: 82.15% name: fc2.weight percentage: 98.49% name: fc2.bias percentage: 98.62% name: fc3.weight percentage: 99.98% name: fc3.bias percentage: 100.0% ``` 參數總量 : 61706 Nodes 4. 模型切割 1. Agent : 卷積層兩層 2. Server : 全連接層三層 #### 2. AlexNet ( 訓練 MC ) 1. [模型參考網站](https://www.itread01.com/content/1545569463.html?fbclid=IwAR3hHSJxIsb1IzfH9mbTedufl8qKwwWHCXMbADRIuUB2anItCAxALTKRPj8) 2. [原作論文](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) AlexNet 是 2012 年 imagenet 比賽的冠軍模型,以第一作者 Alex 命名,其證明了 CNN 在複雜模型下的效能。 3. 模型架構 卷積層五層 ( features ) + 全連接層三層 ( classifier ) 架構: ``` AlexNet( (features): Sequential( (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2)) (1): ReLU(inplace) (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)) (4): ReLU(inplace) (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (7): ReLU(inplace) (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (9): ReLU(inplace) (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (11): ReLU(inplace) (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) ) (classifier): Sequential( (0): Linear(in_features=9216, out_features=4096, bias=True) (1): ReLU(inplace) (2): Dropout(p=0.5) (3): Linear(in_features=4096, out_features=4096, bias=True) (4): ReLU(inplace) (5): Dropout(p=0.5) (6): Linear(in_features=4096, out_features=2, bias=True) ) ) ``` 參數比例 : ``` name: features.0.weight percentage: 0.04% name: features.0.bias percentage: 0.04% name: features.3.weight percentage: 0.58% name: features.3.bias percentage: 0.58% name: features.6.weight percentage: 1.74% name: features.6.bias percentage: 1.74% name: features.8.weight percentage: 3.3% name: features.8.bias percentage: 3.3% name: features.10.weight percentage: 4.33% name: features.10.bias percentage: 4.33% name: classifier.0.weight percentage: 70.54% name: classifier.0.bias percentage: 70.55% name: classifier.3.weight percentage: 99.98% name: classifier.3.bias percentage: 99.99% name: classifier.6.weight percentage: 100.0% name: classifier.6.bias percentage: 100.0% ``` 參數總量:57012034 Nodes 4. 模型切割 1. Agent : 卷積層五層 ( features ) 2. Server : 全連接層三層 ( classifier ) #### 3. VggNet16 ( 訓練 OCT ) 1. [模型參考網站](https://blog.csdn.net/qq_16234613/article/details/79818370) 2. [原作論文](https://arxiv.org/pdf/1409.1556.pdf) VggNet 為牛津大學的 Visual Geometry Group 提出,其獨到處是使用小的卷積層 ( 3 x 3 ),以及更深的模型來提升模型效果。 3. 模型架構 卷積層13層 ( features ) + 全連接層一層 ( classifier ) 架構 : ``` VGG( (features): Sequential( (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace) (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace) (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (7): Dropout(p=0.25) (8): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (9): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (10): ReLU(inplace) (11): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (12): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (13): ReLU(inplace) (14): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (15): Dropout(p=0.25) (16): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (17): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (18): ReLU(inplace) (19): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (20): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (21): ReLU(inplace) (22): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (23): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (24): ReLU(inplace) (25): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (26): Dropout(p=0.25) (27): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (28): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (29): ReLU(inplace) (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (31): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (32): ReLU(inplace) (33): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (34): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (35): ReLU(inplace) (36): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (37): Dropout(p=0.25) (38): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (39): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (40): ReLU(inplace) (41): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (42): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (43): ReLU(inplace) (44): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (45): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (46): ReLU(inplace) (47): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (48): Dropout(p=0.25) (49): AvgPool2d(kernel_size=1, stride=1, padding=0) ) (classifier): Linear(in_features=32768, out_features=5, bias=True) ) ``` 參數比例 : ``` name: features.0.weight percentage: 0.0% name: features.0.bias percentage: 0.0% name: features.1.weight percentage: 0.0% name: features.1.bias percentage: 0.01% name: features.3.weight percentage: 0.25% name: features.3.bias percentage: 0.25% name: features.4.weight percentage: 0.25% name: features.4.bias percentage: 0.25% name: features.8.weight percentage: 0.75% name: features.8.bias percentage: 0.75% name: features.9.weight percentage: 0.75% name: features.9.bias percentage: 0.75% name: features.11.weight percentage: 1.74% name: features.11.bias percentage: 1.74% name: features.12.weight percentage: 1.74% name: features.12.bias percentage: 1.75% name: features.16.weight percentage: 3.73% name: features.16.bias percentage: 3.73% name: features.17.weight percentage: 3.73% name: features.17.bias percentage: 3.73% name: features.19.weight percentage: 7.69% name: features.19.bias percentage: 7.7% name: features.20.weight percentage: 7.7% name: features.20.bias percentage: 7.7% name: features.22.weight percentage: 11.66% name: features.22.bias percentage: 11.66% name: features.23.weight percentage: 11.66% name: features.23.bias percentage: 11.67% name: features.27.weight percentage: 19.59% name: features.27.bias percentage: 19.59% name: features.28.weight percentage: 19.6% name: features.28.bias percentage: 19.6% name: features.30.weight percentage: 35.45% name: features.30.bias percentage: 35.45% name: features.31.weight percentage: 35.46% name: features.31.bias percentage: 35.46% name: features.33.weight percentage: 51.31% name: features.33.bias percentage: 51.31% name: features.34.weight percentage: 51.32% name: features.34.bias percentage: 51.32% name: features.38.weight percentage: 67.17% name: features.38.bias percentage: 67.17% name: features.39.weight percentage: 67.18% name: features.39.bias percentage: 67.18% name: features.41.weight percentage: 83.03% name: features.41.bias percentage: 83.03% name: features.42.weight percentage: 83.04% name: features.42.bias percentage: 83.04% name: features.44.weight percentage: 98.89% name: features.44.bias percentage: 98.89% name: features.45.weight percentage: 98.9% name: features.45.bias percentage: 98.9% name: classifier.weight percentage: 100.0% name: classifier.bias percentage: 100.0% ``` 參數總量 : 14885829 Nodes 4. 模型切割 1. Agent : 卷積層13層 ( features ) 2. Server : 全連接層一層 ( classifier ) #### 4. MLP ( 訓練 ECG ) 1. [模型參考網站](https://www.itread01.com/content/1542450988.html?fbclid=IwAR0yMEu6YZfNCc9Tv8ZBqNiVyYW5hf0vn98a7ufcO7PsGBqrM3jBHSYUgjQ) 2. 原作論文(無) 神經網絡的基礎模型是 [ 感知機 ] ( Perceptron ),而 [ 多重感知機 ] ( Multilayer Perceptron ) 即為 MLP,是神經網路的雛型概念,並非一個特定的模型架構。 3. 模型架構 全連接層四層 架構 : ``` MLP( (fc1): Linear(in_features=187, out_features=1000, bias=True) (fc2): Linear(in_features=1000, out_features=500, bias=True) (fc3): Linear(in_features=500, out_features=125, bias=True) (fc4): Linear(in_features=125, out_features=5, bias=True) ) ``` 參數比例 : ``` name: fc1.weight percentage: 24.88% name: fc1.bias percentage: 25.01% name: fc2.weight percentage: 91.52% name: fc2.bias percentage: 91.59% name: fc3.weight percentage: 99.9% name: fc3.bias percentage: 99.92% name: fc4.weight percentage: 100.0% name: fc4.bias percentage: 100.0% ``` 參數總量 : 751755 Nodes 4. 模型切割 1. Agent : 全連接層兩層 2. Server : 全連接層兩層 #### 5. 四模型參數量比較 * 參數總量 - 比較 ( 以 LeNet 為基礎單位 ) : 1. LeNet: 1.0 倍 2. AlexNet: 923.93 倍 3. VGGNet: 241.24 倍 4. MLP: 12.18 倍 * Agent model 占整體模型比例 : 1. LeNet: 4.17% 2. AlexNet: 4.33% 3. VGGNet: 98.9% 4. MLP: 91.59% * Agent model 參數總量 - 比較 ( 以 LeNet agent model 為基礎單位 ) : 1. LeNet: 1.0 倍 2. AlexNet: 959.38 倍 3. VGGNet: 5721.45 倍 4. MLP: 267.58 倍 #### 1. 集中式 ![](https://i.imgur.com/OG80BPL.png) #### 2. 分散式 ![](https://i.imgur.com/JvUdkzj.png) #### 3. 物件 ( Object ) 介紹 ##### 1. model : 四種模型的架構,每個模型又分為集中式、分散式 server 端、分散式 agent 端,三者各自獨立為一個物件,端看 switch 這次呼叫誰。 ##### 2. switch : 因四種模型訓練,流程基本上一致,差別只在於套用的模型、取用的資料集,switch 輸入為資料及名稱,負責初始化對應的dataSet、model。 ##### 3. central_train : `$ python train/cnetral_train/central_train.py data_name` 集中式訓練所執行的程式,需輸入參數為 data_name,即為本次要訓練的資料集名稱。 其包含一個物件 central_train,並呼叫自身,初始化 switch 以取得 dataSet、集中式 model,並接著詢問使用者是否要從既存模型接續訓練 ( y/n ),若為 y,接著詢問儲存路徑,以及起始 epochs;若為 n 或其他輸入,則直接開始新的模型訓練。 ##### 4. socket : 負責分散式的資訊傳輸,詳情請見 [ 分散式架構系統建置流程技術報告 ] 第三部分。 ##### 5. server : 負責 server 的一切行為 1. `_conn_to_agents` : 建立伺服器 2. `_send_train_args_to_agents` : 傳訓練參數 3. `agents_attr` : 配給 agent snapshot 的 port 4. `_send_agents_attrs_to_agents` : snapshot 時告知 agent 上一個 agent 的 IP 以及 port 5. `_send_id_lists_to_agents` : 配給 agent 可用之 data id list ( 因我們每個 agent 擁有所有資料,為模擬實際狀況,將資料依照比例分割,並限制各 agent 只能取用專屬部分的資料) 6. `_whether_waiting_for_agent` : 等待 agent snapshot 結束,首輪的第一個 agent 不等待 7. `_whether_is_training_done` : 將當前 epoch 告知 agent,讓其判斷是否訓練完畢,因總 epoch 已隨 train_args 告知 agent 8. `_iter_through_agent_database` : 訓練 / 測試某個 agent 的所有資料。 9. `_iter_one_epoch` : 訓練 / 測試一整個 epoch,將呼叫 n 次 `_iter_through_agent_database`, n 為 agent 數量 10. `record_model` : 紀錄模型,每 5 epochs 會呼叫一次,避免不可抗力使實驗中斷,例如停電 11. 其他如紀錄時間、紀錄效能( accuracy、loss )、紀錄 confusion matrix 等紀錄相關功能,皆是由 server 所負責 ##### 6. agent : 負責 agent 的一切行為 1. `_conn_to_server` : 與 server 連線 2. `_recv_train_args_from_server` : 接收訓練參數 3. `_recv_agents_attrs_from_server` : snapshot 時取得前一個 agent 的 IP、port 4. `_recv_id_list_from_server` : 取得專屬資料 id list 5. `_send_model_to_next_agent` : snapshot,傳送 model 參數、優化器參數,不能只傳模型參數,優化器參數亦隨訓練而變動,是不可或缺的要素 6. `_whether_to_recv_model_from_prev_agent` : 接收 model 參數、優化器參數,第一輪的第一個 agnet 並不執行 7. `_whether_is_training_done` : 從 server 接收當前 epoch,比對總 epoch 個數判斷是否為最後一輪,若為最後一輪則 snapshot 給下一個 agent 後結束程式,若為最後一個 agent 則直接結束程式 8. agent 端也會每 5 epochs 紀錄一次模型,避免意外中斷 ##### 7. distributed_server_train `$ python train/distributed_train/distributed_server_train.py data_name agent_nums` 分散式訓練 server 端所執行的程式,需輸入參數為 data_name、agent_nums,即為本次要訓練的資料集名稱以及 agent 個數。 初始化 switch 以取得分散式 model server 端,並接著詢問使用者是否要從既存模型接續訓練 ( y/n ),若為 y,接著詢問儲存路徑,以及起始 epochs;若為 n 或其他輸入,則直接開始新的模型訓練。 ##### 8. distributed_agent_train `$ python train/distributed_train/distributed_server_train.py agent_num server_IP` 分散式訓練 server 端所執行的程式,需輸入參數為 agent_num、server_IP,即為本次 agent 之編號以及 server IP。 初始化 switch 以取得 dataSet、分散式 model agent 端 ### 1-4. 程式流程 #### 1. 集中式 ![](https://i.imgur.com/9JnuboK.png) #### 2. 分散式 ![](https://i.imgur.com/fmF6Wi8.png) ## 2. 使用 PyMongo 進行訓練資料與開發框架之間的串接 ### 2-1. 架構圖 ![](https://i.imgur.com/vpvbSnU.png) 1. mongoDB_processor: 負責與 `mongoDB` 交互的基本函式,包括以`query 讀取資料`、`刪除整個 dataBase`、`插入新資料到 collection`、`從 collection 讀取一個 batch_size 的資料`、`刪除整個 collection` 等。其中這些操作又分為有沒有使用 `gridFS`,`gridFS` 是一種 `pymongo` 提供用於儲存大於16M的文件(如圖片、音頻、視頻等)。 2. file_precoessor:負責讀取 `data/{data_name}/` 中的 `data` 與 `label`。其中實作 `_read_images_director()`,從資料夾讀取圖片路徑,另外也實作對`.txt`檔做讀寫的函式,如`write_nums_to_file()`、`read_nums_from_file()`、`write_list_to_file()`、`read_list_from_file()` 等。 3. data_processor:負責作為 database 與本地端資料夾的橋樑,並實作所有 dataSet 都能共用的函式,如 `_make_sure_data_and_labels_in_database()`、`_upload_data_and_labels_to_database`、`_get_data_and_labels_from_database()`、`get_data_nums_from_database()`...等。其中`_make_sure_data_and_labels_in_database()`會在`程式運作流程`中更詳細的介紹。 4. dataSet:負責各個 dataSet 需獨立運作的函式,如 `_get_data_and_labels_from_local()`,因為不同的 dataSet 有不同儲存在本地端的形式,還有像是`get_data_and_labels()`,這個函式是真正在模型訓練時被呼叫的函式,它的作用是進行模型訓練的預處理,而每個 dataSet 的預處理又有所不同,故寫於此物件中。 ### 2-2. 程式運作流程 1. 存放資料於本地端的資料夾:執行程式前,先將四個 dataSet 以下圖的形式存放於專案中,至於 `data_nums/` 則無須理會,執行時會自動生成,另外 `MNIST/`當中的資料也無須準備,執行時會自動從 pytorch 提供的函式線上下載。 ![](https://i.imgur.com/xzy7HYR.png) 2. 確認 database 中的資料數量與本地資料夾中的資料數量吻合:dataSet 在初始化時,會在 `data_processor` 呼叫 `_make_sure_data_and_labels_in_database()`,這個函式會讀取 `data/data_nums/{data_name}_{data_type}.txt`,此檔案存的是一個數字,是上傳到 database 的 data 總量,如果數字為 0 或是與本地端的 data 總量不符合(假設上傳到一半就終止程式便會發生此情形),那麼便會接著呼叫`drop_coll_from_database()`將原本儲存不完整的資料從 database 中刪除,接著再呼叫 `_get_data_and_labels_from_local()`及`_upload_data_and_labels_to_database()`,準備重新上傳資料到 database ; 反之則表示儲存在 database 的資料完整,已通過檢測。 ```python= def _make_sure_data_and_labels_in_database(self): db_data_nums = self.get_data_nums_from_database() local_data_nums = ( self.read_nums_from_file(os.path.join(self.data_nums_dir_path, '%s_%s.txt' % (self.data_name, self.data_type)))) if db_data_nums != local_data_nums or db_data_nums == 0: if not db_data_nums == 0: self.drop_coll_from_database() data_or_data_path, labels = self._get_data_and_labels_from_local() self._upload_data_and_labels_to_database(data_or_data_path, labels) ``` 3. 上傳資料到 database: 有兩個步驟,第一個是`_get_data_and_labels_from_local()`,第二個是 `_upload_data_and_labels_to_database()`,`_get_data_and_labels_from_local()`會從本地端的資料夾讀取 [`data` 或 `圖片路徑`] 及 `label`,如果 dataSet 是圖片的形式,會在 `mongoDB_processor` 中的`gridFS_coll_insert()` 將`圖片路徑`打開成`byte`的形式,最後`_upload_data_and_labels_to_database()`再將兩兩成對的 `data` 與 `label`上傳到 database 中,值得一提的是,上傳到 database 時還會額外記錄 `data` 的 `id`,`id`是從 1 開始編號的數字,這一點非常重要,因為當在訓練時要從 database 取出一個 `batchsize` 的 `data` 與 `label` ,便要利用 `query` 根據某個 `id` 區段取得。至於各個 dataSet 的 `data` 與 `label` 儲存在本地資料夾的形式如下(不同形式會影響到各個 dataSet `_get_data_and_labels_from_local()`實作): 1. `MNIST`:以 `pytorch` 提供的 `datasets.MNIST` 線上下載,其 `data` 、`label` 已相互對應。 2. `MC`: 從資料夾讀取 `.img `檔,`label` 則是圖片檔案所在的資料夾名稱。 3. `OCT` : 從資料夾讀取 `.img `檔,`label` 則存在於其圖片檔案名稱當中。 4. `ECG`:從資料夾讀取 `.csv`檔,其 `data` 為檔案中前 187 個 columns,`label` 則是檔案中最後一個 column。 3. 訓練前從 database 讀取資料:各個 dataSet 呼叫 `{data_name}_dataSet` 中的 `get_data_and_labels()`,這個函式不管是哪個 dataSet,都一定還會再呼叫`data_processor` 中的 `_get_data_and_labels_from_database()`,它會維護一個 `id_ptr`,並根據 `id_list`(在 `data_processor` 初始化時宣吿,記錄訓練時「所有」要使用的 data `id`,通常是 [1 ~ data_nums]),決定需要從 database 取得哪些 id 的 `data` 與 `label` (共 batch_size 筆,除了 dataSet 最後一個 batch)。它的程式碼如下: ```python= def _get_data_and_labels_from_database(self, batch_size): usage_data_nums = self.get_usage_data_nums() old_id_ptr = self.data_id_ptr # 維護 id_ptr,最後一個 batch 時只取剩下的,而沒有補足到完整的 batch_size 筆 if old_id_ptr + batch_size >= usage_data_nums: new_id_ptr = 0 id_list = self.usage_data_ids[old_id_ptr:] else: new_id_ptr = old_id_ptr + batch_size id_list = self.usage_data_ids[old_id_ptr: new_id_ptr] self.data_id_ptr = new_id_ptr # 根據 dataSet 是否需要使用 gridFS 來決定呼叫的函式 if self.use_gridFS: data, labels = self.gridFS_coll_read_batch(self.coll_name, id_list) else: data, labels = self.coll_read_batch(self.coll_name, id_list) return data, labels ``` ### 2-3. mongoDB_processor 實作函示總覽與簡介 ---------------------- #### 前述 1. 此物件被包覆在各個 dataSet 中,表示當呼叫此物件的函式時,只會影響到其所屬的 dataSet。例如: ``` MC_DataSet(MC_TRAIN_ARGS).gridFS_coll_read_batch(...) ``` 只會讀取 `MC` 一個 `batch_size` 的 `train data` 及 `train label`。 2. 對 database 的操作又可以分為有無使用 `gridFS`,`gridFS` 是一種 `pymongo` 提供用來儲存大於16M 容量的文件的函式庫。在這次的專案中,`MC`、`OCT` 有使用 `gridFS` 而 `MNIST`、`ECG` 沒有。值得注意的是,下面介紹的各個函式,前面有 `gridFS` 開頭的表示有使用 `gridFS`,注意函式不得混用,如果是以 `gridFS` 的方式上傳至 database,那麼一定也要使用 `gridFS` 才能從 database 將資料讀取。 3. `mongodb` 中 `collection` 是 `table` 的觀念,實作中,每一個 dataSet 存放於一個 `database`,每一個 `database` 又有兩個 `collection` ,分別是 `train_data_labels` 及 `test_data_labels`,他們各自存放 `train data` 及 `test data`。 ---------------------- 1. coll_insert (coll_name, data, labels) 插入 data、 labels 至 `collection`,並額外標上每筆 data 的 `id`。 ```python= def coll_insert(self, coll_name, data, labels): coll = self.db[coll_name] # 連上 collection data_labels_dicts = [] for i in range(len(data)): data_label_dict = { 'ID': i + 1, # 重要,之後在 coll_read_batch() 使用 query 查找時需要用到 'data': data[i], 'label': labels[i], } data_labels_dicts.append(data_label_dict) coll.insert_many(data_labels_dicts) # 一次上傳多筆到 database ``` 2. coll_find_all (coll_name) 回傳 `collection` 中所有 data 的 `cursor`,`cursor` 是 `mongodb` 的一種資料結構,其用法可以參照 [cursor 用法](https://docs.mongodb.com/manual/reference/method/js-cursor/)。 3. coll_find_query(coll_name, query) 回傳 `collection` 中符合 query 的 data 的 `cursor`。 4. coll_delete_all (coll_name) 刪除 `collection` 中 `所有` 的 `data` 與 `label`。 5. coll_read_all_labels(coll_name) 回傳 `collection` 中 `所有` 的 `label`。 6. coll_read_all (coll_name) 回傳 `collection` 中 `所有` 的 `data`、 `label`。   7. coll_read_batch (coll_name, id_list) 回傳 `collection` 中一個 `batch_size` 的 `data`、 `label`,其中傳入的 `id_list` 為該 batch_size 筆 data 的 `id`。 ```python= def coll_read_batch(self, coll_name, id_list): batch_data = [] batch_labels = [] find_query = {'ID': {"$in": id_list}} # 此 query 用來查找所有符合 id 的 data batch_data_labels = self.coll_find_query(coll_name=coll_name, query=find_query) for i, data_label in list(enumerate(batch_data_labels, start=1)): data = data_label['data'] label = data_label['label'] batch_data.append(data) batch_labels.append(label) self.__logger.debug('Done !') return np.array(batch_data), np.array(batch_labels) ``` 8. drop_database() 刪除整個 `database`,其中包含 `train_data_labels`、`test_data_labels` 兩個 `collection`。 9. gridFS_coll_insert (coll_name, data_file_paths, labels) 以 gridFS 將 data 上傳至 DataBase,傳入的參數是 data_file_paths 而不是 data,原因是以gridFS上傳的格式中必須是 "str" 或 "檔案" ,實作中選擇將圖片以檔案開啟後上傳。 ```python= # 初始化 GridFS,collection傳入的參數為要把data、labels儲入的table名稱 fs = GridFS(db, collection="data_label") # 所有data路徑 for i in range(len(data_file_paths)): # 一筆data路徑與其對應的label data_file_path = data_file_paths[i] label = int(labels[i]) # metadata,其中 ID 在 gridFS_coll_read_batch() 使用 query 查找時需要用到 dic = { "label": label, "file_name": re.split(r"[/\\]", data_file_path)[-1], "ID": i } # 以二進位讀取圖片檔案並上傳至資料庫 fs.put(open(data_file_path, 'rb'), **dic) ``` 10. gridFS_coll_download_all (coll_name, download_dir_path, category_name) 下載所有 `collection` 裡的資料到路徑: ``{download_dir_path}/{category_name}`。 11. gridFS_coll_read_all_labels(coll_name) 以 `gridFS` 回傳 `collection` 中 `所有` 的 `label`。 12. gridFS_coll_read_batch (db_name, id_pointer, batch_size) 以 `gridFS` 回傳 `collection` 中一個 `batch_size` 的 `data`、 `label`,其中傳入的 `id_list` 為該 batch_size 筆 data 的 `id`。值得注意的是存在 database 的是二進位資料,所以讀出來之後需要以 `io.BytesIO()` 轉換才能以 `Image.open()` 正常開啟。最後回傳的格式是 `ndarray`。 ```python= def gridFS_coll_read_batch(self, coll_name, id_list): fs = GridFS(self.db, coll_name) find_query = {'ID': {"$in": id_list}} batch_grid_outs = fs.find(find_query) batch_grid_outs_count = batch_grid_outs.count() batch_images = [] batch_labels = [] for i, grid_out in list(enumerate(batch_grid_outs)): byte_data = grid_out.read() image = np.array(Image.open(io.BytesIO(byte_data))) label = grid_out.label batch_images.append(image) label = np.array(label) batch_labels.append(label) return np.array(batch_images), np.array(batch_labels) ``` 16. gridFS_coll_delete_all (coll_name) 刪除 `collection` 中所有以 `gridFS` 上傳的資料。 18. gridFS_find_all (coll_name) 回傳 `collection` 中所有以 `gridFS` 上傳的資料。 ## 3. 傳統集中式訓練環境效能測試 * 環境 : 1. Ubuntu 18.04 2. Python 3.6 ### 3-1. MNIST + LeNet ( lr = 0.01, image_size = ( 28, 28 ), 30 epochs ) 1. acc ![](https://i.imgur.com/w7ha89y.png) 2. loss ![](https://i.imgur.com/P9pROlD.png) 3. confusion matrix ![](https://i.imgur.com/62j4ind.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/q3KNAr6.png) 5. training time record 耗費時間 : 5 分鐘 <font color='green'> 開始時間 : Mon Oct 14 18:45:25 2019 Epoch 1 Train set: Average loss: 1.3772, Accuracy: 37885/60000 (63%) Test set: Average loss: 0.1867, Accuracy: 9426/10000 (94%) ... ... Epoch 30 Train set: Average loss: 0.0535, Accuracy: 59148/60000 (98%) Test set: Average loss: 0.1321, Accuracy: 9775/10000 (97%) 結束時間 : Mon Oct 14 18:50:48 2019 </font> ### 3-2. MC + AlexNet ( lr = 0.00001, images_size = ( 224, 224 ), 50 epochs ) 1. acc ![](https://i.imgur.com/BVW5oec.png) 2. loss ![](https://i.imgur.com/G9CiHz5.png) 3. confusion matrix ![](https://i.imgur.com/Z8o6cjW.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/KicBf8S.png) 5. training time record 耗費時間 : **1 小時 6 分鐘** ( 使用 2060 ) <font color='green'> 開始時間 : Fri Oct 25 12:35:55 2019 Epoch 1 Train set: Average loss: 0.6917, Accuracy: 13453/24812 (54.22%) Test set: Average loss: 0.6838, Accuracy: 1660/2754 (60.28%)Epoch 50 ... ... Train set: Average loss: 0.1100, Accuracy: 23869/24812 (96.20%) Test set: Average loss: 0.1390, Accuracy: 2616/2754 (94.99%) 結束時間 : Fri Oct 25 13:41:59 2019 </font> ### 3-3. OCT + VGG ( lr = 0.005, image_size = ( 256, 256 ), 50 epochs ) 1. acc ![](https://i.imgur.com/Drclp7v.png) 2. loss ![](https://i.imgur.com/XXrX3N1.png) 3. confusion matrix ![](https://i.imgur.com/Yt0vBdD.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/UATvKjc.png) 5. training time record 耗費時間 : **5 小時 31 分鐘** <font color='green'> 開始時間 : Tue Oct 22 13:46:02 2019 Epoch 1 Train set: Average loss: 1.7210, Accuracy: 14086/32000 (44.02%) Test set: Average loss: 0.5840, Accuracy: 774/968 (79.96%) ... ... Epoch 50 Train set: Average loss: 0.0384, Accuracy: 31618/32000 (98.81%) Test set: Average loss: 0.0256, Accuracy: 962/968 (99.38%) 結束時間 : Tue Oct 22 19:17:25 2019 </font> ### 3-4. ECG + MLP ( lr = 0.001, data_size = 187, 30 epochs ) 1. acc ![](https://i.imgur.com/N2poVqJ.png) 2. loss ![](https://i.imgur.com/65vMHaS.png) 3. confusion matrix ![](https://i.imgur.com/1ZeMk0b.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/K3xS0So.png) 5. training time record 耗費時間 : 7 分鐘 <font color='green'> 開始時間 : Sun Oct 20 20:28:52 2019 Epoch 1 Train set: Average loss: 0.3917, Accuracy: 77853/87554 (88.92%) Test set: Average loss: 0.2543, Accuracy: 20317/21892 (92.81%) ... ... Epoch 30 Train set: Average loss: 0.0308, Accuracy: 86687/87554 (99.01%) Test set: Average loss: 0.1088, Accuracy: 21407/21892 (97.78%) 結束時間 : Sun Oct 20 20:35:33 2019 </font> ## 4. 分散式架構訓練環境效能測試 ( 2, 3, 4 agents ) * 環境 : 1. Ubuntu 18.04 2. Python 3.6 ### 4-1. MNIST + LeNet ( 30 epochs ) 1. 2 agents 1. acc ![](https://i.imgur.com/gf5qQpD.png) 2. loss ![](https://i.imgur.com/4w8AA6g.png) 3. confusion matrix ![](https://i.imgur.com/oLlSHPG.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/RiBnPm9.png) 5. training time record 耗費時間 : 44 分鐘 <font color='green'> 開始時間: Sat Oct 19 16:34:52 2019 Agent_nums = 2 Epoch 1 Train set: Average loss: 0.0127, Accuracy: 48221/60000 (80.37%) Test set: Average loss: 0.0194, Accuracy: 9389/10000 (93.89%) ... ... Epoch 30 Train set: Average loss: 0.0013, Accuracy: 58913/60000 (98.19%) Test set: Average loss: 0.0116, Accuracy: 9762/10000 (97.62%) 結束時間: Sat Oct 19 17:18:54 2019 </font> 3. 3 agents 1. acc ![](https://i.imgur.com/GDaxRCY.png) 2. loss ![](https://i.imgur.com/V2Xkqic.png) 3. confusion matrix ![](https://i.imgur.com/VlczHf1.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/7cE9rMK.png) 5. training time record 耗費時間 : 46 分鐘 <font color='green'> 開始時間: Sat Oct 19 14:50:18 2019 Agent_num = 3 Epoch 1 Train set: Average loss: 0.0557, Accuracy: 14311/60000 (23.85%) Test set: Average loss: 0.2187, Accuracy: 4646/10000 (46.46%) ... ... Epoch 30 Train set: Average loss: 0.0018, Accuracy: 58850/60000 (98.08%) Test set: Average loss: 0.0151, Accuracy: 9775/10000 (97.75%) 結束時間: Sat Oct 19 15:36:08 2019 </font> 4. 4 agents 1. acc ![](https://i.imgur.com/hZoQIQp.png) 2. loss ![](https://i.imgur.com/iPrlnUn.png) 3. confusion matrix ![](https://i.imgur.com/DHuF48L.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/IexMwkO.png) 5. training time record 耗費時間 : 64 分鐘 <font color='green'> 開始時間: Fri Nov 1 12:39:15 2019 Agent_nums = 4 Epoch 1 agent_2 開始 snapshot : Fri Nov 1 12:39:58 2019 agent_2 結束 snapshot : Fri Nov 1 12:39:58 2019 agent_3 開始 snapshot : Fri Nov 1 12:40:30 2019 agent_3 結束 snapshot : Fri Nov 1 12:40:30 2019 agent_4 開始 snapshot : Fri Nov 1 12:40:59 2019 agent_4 結束 snapshot : Fri Nov 1 12:40:59 2019 Train set: Average loss:0.0292, Accuracy: 42328/60000 (70.55%) agent_1 開始 snapshot : Fri Nov 1 12:41:33 2019 agent_1 結束 snapshot : Fri Nov 1 12:41:33 2019 agent_2 開始 snapshot : Fri Nov 1 12:41:35 2019 agent_2 結束 snapshot : Fri Nov 1 12:41:35 2019 agent_3 開始 snapshot : Fri Nov 1 12:41:36 2019 agent_3 結束 snapshot : Fri Nov 1 12:41:36 2019 agent_4 開始 snapshot : Fri Nov 1 12:41:38 2019 agent_4 結束 snapshot : Fri Nov 1 12:41:38 2019 Test set: Average loss:0.0291, Accuracy: 9533/10000 (95.33%) ... ... Epoch 30 agent_1 開始 snapshot : Fri Nov 1 13:41:54 2019 agent_1 結束 snapshot : Fri Nov 1 13:41:54 2019 agent_2 開始 snapshot : Fri Nov 1 13:42:20 2019 agent_2 結束 snapshot : Fri Nov 1 13:42:20 2019 agent_3 開始 snapshot : Fri Nov 1 13:42:47 2019 agent_3 結束 snapshot : Fri Nov 1 13:42:47 2019 agent_4 開始 snapshot : Fri Nov 1 13:43:13 2019 agent_4 結束 snapshot : Fri Nov 1 13:43:14 2019 Train set: Average loss:0.0024, Accuracy: 59006/60000 (98.34%) agent_1 開始 snapshot : Fri Nov 1 13:43:38 2019 agent_1 結束 snapshot : Fri Nov 1 13:43:38 2019 agent_2 開始 snapshot : Fri Nov 1 13:43:40 2019 agent_2 結束 snapshot : Fri Nov 1 13:43:40 2019 agent_3 開始 snapshot : Fri Nov 1 13:43:41 2019 agent_3 結束 snapshot : Fri Nov 1 13:43:41 2019 agent_4 開始 snapshot : Fri Nov 1 13:43:43 2019 agent_4 結束 snapshot : Fri Nov 1 13:43:43 2019 Test set: Average loss:0.0364, Accuracy: 9704/10000 (97.04%) 結束時間: Fri Nov 1 13:43:46 2019 </font> ### 4-2. MC + AlexNet ( 50 epochs ) 1. 2 agents 1. acc ![](https://i.imgur.com/55b9G8A.png) 2. loss ![](https://i.imgur.com/NxsGSZz.png) 3. confusion matrix ![](https://i.imgur.com/cMAsvvI.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/9HNG6RV.png) 5. training time record 耗費時間 : 70 分鐘 <font color='green'> 開始時間: Wed Oct 30 18:48:41 2019 Agent_nums = 2 Epoch 1 agent_2 開始 snapshot : Wed Oct 30 18:49:19 2019 agent_2 結束 snapshot : Wed Oct 30 18:49:19 2019 Train set: Average loss:0.0207, Accuracy: 14456/24812 (58.26%) agent_1 開始 snapshot : Wed Oct 30 18:49:58 2019 agent_1 結束 snapshot : Wed Oct 30 18:49:58 2019 agent_2 開始 snapshot : Wed Oct 30 18:50:01 2019 agent_2 結束 snapshot : Wed Oct 30 18:50:02 2019 Test set: Average loss:0.1630, Accuracy: 1756/2754 (63.76%) ... ... Epoch 50 agent_1 開始 snapshot : Wed Oct 30 19:56:48 2019 agent_1 結束 snapshot : Wed Oct 30 19:56:48 2019 agent_2 開始 snapshot : Wed Oct 30 19:57:26 2019 agent_2 結束 snapshot : Wed Oct 30 19:57:26 2019 Train set: Average loss:0.0028, Accuracy: 24045/24812 (96.91%) agent_1 開始 snapshot : Wed Oct 30 19:58:04 2019 agent_1 結束 snapshot : Wed Oct 30 19:58:04 2019 agent_2 開始 snapshot : Wed Oct 30 19:58:08 2019 agent_2 結束 snapshot : Wed Oct 30 19:58:08 2019 Test set: Average loss:0.0336, Accuracy: 2616/2754 (94.99%) 結束時間 : Wed Oct 30 19:58:12 2019 </font> 2. 3 agents 1. acc ![](https://i.imgur.com/CnoDJr1.png) 2. loss ![](https://i.imgur.com/kU1PQHx.png) 3. confusion matrix ![](https://i.imgur.com/8hLy1Oq.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/JR83NNk.png) 5. training time record 耗費時間 : 15小時 4分鐘 <font color='green'> 開始時間: Thu Oct 31 13:14:12 2019 Agent_nums = 3 Epoch 1 agent_2 開始 snapshot : Thu Oct 31 13:19:20 2019 agent_2 結束 snapshot : Thu Oct 31 13:19:22 2019 agent_3 開始 snapshot : Thu Oct 31 13:24:29 2019 agent_3 結束 snapshot : Thu Oct 31 13:24:31 2019 Train set: Average loss:0.0310, Accuracy: 14261/24812 (57.48%) agent_1 開始 snapshot : Thu Oct 31 13:29:23 2019 agent_1 結束 snapshot : Thu Oct 31 13:29:25 2019 agent_2 開始 snapshot : Thu Oct 31 13:29:36 2019 agent_2 結束 snapshot : Thu Oct 31 13:29:36 2019 agent_3 開始 snapshot : Thu Oct 31 13:29:47 2019 agent_3 結束 snapshot : Thu Oct 31 13:29:48 2019 Test set: Average loss:0.2184, Accuracy: 1661/2754 (60.31%) ... ... Epoch 50 agent_1 開始 snapshot : Fri Nov 1 04:02:26 2019 agent_1 結束 snapshot : Fri Nov 1 04:02:26 2019 agent_2 開始 snapshot : Fri Nov 1 04:06:58 2019 agent_2 結束 snapshot : Fri Nov 1 04:07:00 2019 agent_3 開始 snapshot : Fri Nov 1 04:12:21 2019 agent_3 結束 snapshot : Fri Nov 1 04:12:24 2019 Train set: Average loss:0.0041, Accuracy: 24069/24812 (97.01%) agent_1 開始 snapshot : Fri Nov 1 04:17:42 2019 agent_1 結束 snapshot : Fri Nov 1 04:17:43 2019 agent_2 開始 snapshot : Fri Nov 1 04:17:54 2019 agent_2 結束 snapshot : Fri Nov 1 04:17:54 2019 agent_3 開始 snapshot : Fri Nov 1 04:18:05 2019 agent_3 結束 snapshot : Fri Nov 1 04:18:06 2019 Test set: Average loss:0.0425, Accuracy: 2627/2754 (95.39%) 結束時間: Fri Nov 1 04:18:17 2019 </font> 3. 4 agents 1. acc ![](https://i.imgur.com/ChFmToO.png) 2. loss ![](https://i.imgur.com/RsCpMWL.png) 3. confusion matrix ![](https://i.imgur.com/bwSoYON.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/NPgcoLv.png) 5. training time record 耗費時間 : 14小時 47分鐘 <font color='green'> 開始時間: Fri Nov 1 15:40:19 2019 Agent_nums = 4 Epoch 1 agent_2 開始 snapshot : Fri Nov 1 15:45:09 2019 agent_2 結束 snapshot : Fri Nov 1 15:45:11 2019 agent_3 開始 snapshot : Fri Nov 1 15:49:19 2019 agent_3 結束 snapshot : Fri Nov 1 15:49:21 2019 agent_4 開始 snapshot : Fri Nov 1 15:53:50 2019 agent_4 結束 snapshot : Fri Nov 1 15:53:52 2019 Train set: Average loss:0.0400, Accuracy: 14225/24812 (57.33%) agent_1 開始 snapshot : Fri Nov 1 15:58:05 2019 agent_1 結束 snapshot : Fri Nov 1 15:58:07 2019 agent_2 開始 snapshot : Fri Nov 1 15:58:16 2019 agent_2 結束 snapshot : Fri Nov 1 15:58:17 2019 agent_3 開始 snapshot : Fri Nov 1 15:58:25 2019 agent_3 結束 snapshot : Fri Nov 1 15:58:26 2019 agent_4 開始 snapshot : Fri Nov 1 15:58:35 2019 agent_4 結束 snapshot : Fri Nov 1 15:58:35 2019 Test set: Average loss:0.3196, Accuracy: 1734/2754 (62.96%) ... ... Epoch 50 agent_1 開始 snapshot : Sat Nov 2 06:11:03 2019 agent_1 結束 snapshot : Sat Nov 2 06:11:04 2019 agent_2 開始 snapshot : Sat Nov 2 06:14:30 2019 agent_2 結束 snapshot : Sat Nov 2 06:14:32 2019 agent_3 開始 snapshot : Sat Nov 2 06:18:46 2019 agent_3 結束 snapshot : Sat Nov 2 06:18:49 2019 agent_4 開始 snapshot : Sat Nov 2 06:22:49 2019 agent_4 結束 snapshot : Sat Nov 2 06:22:51 2019 Train set: Average loss:0.0055, Accuracy: 24010/24812 (96.77%) agent_1 開始 snapshot : Sat Nov 2 06:26:57 2019 agent_1 結束 snapshot : Sat Nov 2 06:26:59 2019 agent_2 開始 snapshot : Sat Nov 2 06:27:07 2019 agent_2 結束 snapshot : Sat Nov 2 06:27:07 2019 agent_3 開始 snapshot : Sat Nov 2 06:27:15 2019 agent_3 結束 snapshot : Sat Nov 2 06:27:15 2019 agent_4 開始 snapshot : Sat Nov 2 06:27:23 2019 agent_4 結束 snapshot : Sat Nov 2 06:27:23 2019 Test set: Average loss:0.0796, Accuracy: 2590/2754 (94.05%) 結束時間: Sat Nov 2 06:27:32 2019 </font> ### 4-3. OCT + VGG ( 50 epochs ) 1. 2 agents 1. acc ![](https://i.imgur.com/QbSxXbA.png) 2. loss ![](https://i.imgur.com/CLBWMwy.png) 3. confusion matrix ![](https://i.imgur.com/5v2mcBt.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/lNygVSU.png) 5. training time record 耗費時間 : 2 天 7小時 5分鐘 (80 epochs) 耗費時間 : 34小時 22分鐘 (50 epochs) <font color='green'> 開始時間: Sun Oct 27 11:09:13 2019 Agent_nums = 2 Epoch 1 Train set: Average loss: 0.0012, Accuracy: 9014/32000 (28.17%) Test set: Average loss: 0.0277, Accuracy: 395/968 (40.81%) ... ... Epoch 50 Train set: Average loss: 0.0000, Accuracy: 31853/32000 (99.54%) Test set: Average loss: 0.0005, Accuracy: 964/968 (99.59%) ... ... Epoch 80 Train set: Average loss: 0.0000, Accuracy: 31899/32000 (99.68%) Test set: Average loss: 0.0006, Accuracy: 962/968 (99.38%) 結束時間: Tue Oct 29 18:14:05 2019 </font> 2. 3 agents 1. acc ![](https://i.imgur.com/PZRD0e6.png) 2. loss ![](https://i.imgur.com/zv7nUoD.png) 3. confusion matrix ![](https://i.imgur.com/xWu4Liz.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/rmdGDUN.png) 5. training time record 耗費時間 : 26小時 25分鐘 (50 epochs) <font color='green'> 開始時間: Wed Oct 23 09:57:23 2019 Agent_nums = 3 Epoch 1 Train set: Average loss: 0.0015, Accuracy: 16337/32000 (51.05%) Test set: Average loss: 0.0133, Accuracy: 824/968 (85.12%) ... ... Epoch 50 Train set: Average loss: 0.0000, Accuracy: 31734/32000 (99.17%) Test set: Average loss: 0.0004, Accuracy: 965/968 (99.69%) 結束時間: Thu Oct 24 12:22:06 2019 </font> 3. 4 agents 1. acc 2. loss 3. confusion matrix 4. confusion matrix ( normalized ) 5. training time record <font color='green'> </font> ### 4-4. ECG + MLP ( 30 epochs ) 1. 2 agents 1. acc ![](https://i.imgur.com/H7qMU8w.png) 2. loss ![](https://i.imgur.com/Z3WOm2o.png) 3. confusion matrix ![](https://i.imgur.com/Lyb55L7.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/gCbtudA.png) 5. training time record 耗費時間 : 10 分鐘 <font color='green'> 開始時間: Sun Oct 20 21:18:13 2019 Agent_nums = 2 Epoch 1 Train set: Average loss: 0.0047, Accuracy: 77583/87554 (88.61%) Test set: Average loss: 0.0108, Accuracy: 20438/21892 (93.36%) ... ... Epoch 30 Train set: Average loss: 0.0004, Accuracy: 86616/87554 (98.93%) Test set: Average loss: 0.0050, Accuracy: 21436/21892 (97.92%) 結束時間: Sun Oct 20 21:28:30 2019 </font> 2. 3 agents 1. acc ![](https://i.imgur.com/RyGBJla.png) 2. loss ![](https://i.imgur.com/1E6Jbf2.png) 3. confusion matrix ![](https://i.imgur.com/A977zxT.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/rFJpYrO.png) 5. training time record 耗費時間 : 1 小時 36 分鐘 ( 使用 2060 當 server ) <font color='green'> 開始時間: Mon Oct 21 11:14:29 2019 Agent_nums = 3 Epoch 1 Train set: Average loss: 0.0067, Accuracy: 77835/87554 (88.90%) Test set: Average loss: 0.0139, Accuracy: 20512/21892 (93.70%) ... ... Epoch 30 Train set: Average loss: 0.0005, Accuracy: 86668/87554 (98.99%) Test set: Average loss: 0.0076, Accuracy: 21419/21892 (97.84%) 結束時間: Mon Oct 21 12:50:42 2019 </font> 3. 4 agents 1. acc ![](https://i.imgur.com/ShwzAqQ.png) 2. loss ![](https://i.imgur.com/Smhedjh.png) 3. confusion matrix ![](https://i.imgur.com/hwnTsLf.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/xV9D9pt.png) 5. training time record 耗費時間 : 1 小時 42 分鐘 <font color='green'> 開始時間: Fri Nov 1 13:53:53 2019 Agent_nums = 4 Epoch 1 agent_2 開始 snapshot : Fri Nov 1 13:54:42 2019 agent_2 結束 snapshot : Fri Nov 1 13:54:42 2019 agent_3 開始 snapshot : Fri Nov 1 13:55:41 2019 agent_3 結束 snapshot : Fri Nov 1 13:55:42 2019 agent_4 開始 snapshot : Fri Nov 1 13:56:23 2019 agent_4 結束 snapshot : Fri Nov 1 13:56:24 2019 Train set: Average loss:0.0094, Accuracy: 77865/87554 (88.93%) agent_1 開始 snapshot : Fri Nov 1 13:57:14 2019 agent_1 結束 snapshot : Fri Nov 1 13:57:14 2019 agent_2 開始 snapshot : Fri Nov 1 13:57:18 2019 agent_2 結束 snapshot : Fri Nov 1 13:57:18 2019 agent_3 開始 snapshot : Fri Nov 1 13:57:22 2019 agent_3 結束 snapshot : Fri Nov 1 13:57:22 2019 agent_4 開始 snapshot : Fri Nov 1 13:57:25 2019 agent_4 結束 snapshot : Fri Nov 1 13:57:25 2019 Test set: Average loss:0.0183, Accuracy: 20662/21892 (94.38%) ... ... Epoch 30 agent_1 開始 snapshot : Fri Nov 1 15:31:18 2019 agent_1 結束 snapshot : Fri Nov 1 15:31:18 2019 agent_2 開始 snapshot : Fri Nov 1 15:32:16 2019 agent_2 結束 snapshot : Fri Nov 1 15:32:17 2019 agent_3 開始 snapshot : Fri Nov 1 15:33:04 2019 agent_3 結束 snapshot : Fri Nov 1 15:33:05 2019 agent_4 開始 snapshot : Fri Nov 1 15:33:50 2019 agent_4 結束 snapshot : Fri Nov 1 15:33:51 2019 Train set: Average loss:0.0007, Accuracy: 86760/87554 (99.09%) agent_1 開始 snapshot : Fri Nov 1 15:35:02 2019 agent_1 結束 snapshot : Fri Nov 1 15:35:03 2019 agent_2 開始 snapshot : Fri Nov 1 15:35:06 2019 agent_2 結束 snapshot : Fri Nov 1 15:35:06 2019 agent_3 開始 snapshot : Fri Nov 1 15:35:10 2019 agent_3 結束 snapshot : Fri Nov 1 15:35:10 2019 agent_4 開始 snapshot : Fri Nov 1 15:35:14 2019 agent_4 結束 snapshot : Fri Nov 1 15:35:14 2019 Test set: Average loss:0.0094, Accuracy: 21421/21892 (97.85%) 結束時間: Fri Nov 1 15:35:21 2019 </font> ## 4. 分散式架構訓練環境效能測試 ( 2, 3, 4 agents ) (Again) * 環境 : 1. Ubuntu 18.04 2. Python 3.6 * 前言 : 有鑑於 wifi 影響測試時間,老師又新購入 switch 支援,將所有電腦以有線網路連接。為求條件一致,在程式碼不變動的情況下,全部重新測試一次。 1. 2 agents * server : 2080ti * agent_1 : 2060 * agent_2 : 2060 1. 3 agents * server : 2080ti * agent_1 : 2060 * agent_2 : 2060 * agent_3 : 2060 1. 4 agents * server : 2080ti * agent_1 : 2060 * agent_2 : 2060 * agent_3 : 2060 * agent_4 : 2080s * [雲端紀錄 : record ( no wifi )](https://drive.google.com/open?id=1ArM-0b6gU66MgQOVWBFoCuMbWcwaj-29) ### 4-1. MNIST + LeNet ( 30 epochs ) #### **2 agents** 1. acc ![](https://i.imgur.com/e2PWfUD.png) 2. loss ![](https://i.imgur.com/8EADC7I.png) 3. confusion matrix ![](https://i.imgur.com/Vbh0z0K.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/LxIQUcU.png) 5. training time record 花費時間 : 7 mins <font color='green'> 開始時間: Tue Nov 5 16:18:51 2019 Agent_nums = 2 Epoch 1 agent_2 開始 snapshot : Tue Nov 5 16:18:58 2019 agent_2 結束 snapshot : Tue Nov 5 16:18:58 2019 Train set: Average loss:0.0093, Accuracy: 50740/60000 (84.57%) agent_1 開始 snapshot : Tue Nov 5 16:19:05 2019 agent_1 結束 snapshot : Tue Nov 5 16:19:05 2019 agent_2 開始 snapshot : Tue Nov 5 16:19:05 2019 agent_2 結束 snapshot : Tue Nov 5 16:19:05 2019 Test set: Average loss:0.0136, Accuracy: 9586/10000 (95.86%) ... ... Epoch 30 agent_1 開始 snapshot : Tue Nov 5 16:25:49 2019 agent_1 結束 snapshot : Tue Nov 5 16:25:49 2019 agent_2 開始 snapshot : Tue Nov 5 16:25:57 2019 agent_2 結束 snapshot : Tue Nov 5 16:25:57 2019 Train set: Average loss:0.0013, Accuracy: 58932/60000 (98.22%) agent_1 開始 snapshot : Tue Nov 5 16:26:04 2019 agent_1 結束 snapshot : Tue Nov 5 16:26:04 2019 agent_2 開始 snapshot : Tue Nov 5 16:26:04 2019 agent_2 結束 snapshot : Tue Nov 5 16:26:04 2019 Test set: Average loss:0.0152, Accuracy: 9717/10000 (97.17%) 結束時間: Tue Nov 5 16:26:06 2019 </font> #### **3 agents** 1. acc ![](https://i.imgur.com/QOyAzmp.png) 2. loss ![](https://i.imgur.com/xDKV8It.png) 3. confusion matrix ![](https://i.imgur.com/I8motaM.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/7qC34hW.png) 5. training time record 花費時間 : 7 mins <font color='green'> 開始時間: Tue Nov 5 05:04:06 2019 Agent_nums = 3 Epoch 1 agent_2 開始 snapshot : Tue Nov 5 05:04:10 2019 agent_2 結束 snapshot : Tue Nov 5 05:04:10 2019 agent_3 開始 snapshot : Tue Nov 5 05:04:15 2019 agent_3 結束 snapshot : Tue Nov 5 05:04:15 2019 Train set: Average loss:0.0149, Accuracy: 49482/60000 (82.47%) agent_1 開始 snapshot : Tue Nov 5 05:04:19 2019 agent_1 結束 snapshot : Tue Nov 5 05:04:19 2019 agent_2 開始 snapshot : Tue Nov 5 05:04:20 2019 agent_2 結束 snapshot : Tue Nov 5 05:04:20 2019 agent_3 開始 snapshot : Tue Nov 5 05:04:20 2019 agent_3 結束 snapshot : Tue Nov 5 05:04:20 2019 Test set: Average loss:0.0198, Accuracy: 9573/10000 (95.73%) ... ... Epoch 30 agent_1 開始 snapshot : Tue Nov 5 05:11:04 2019 agent_1 結束 snapshot : Tue Nov 5 05:11:04 2019 agent_2 開始 snapshot : Tue Nov 5 05:11:09 2019 agent_2 結束 snapshot : Tue Nov 5 05:11:09 2019 agent_3 開始 snapshot : Tue Nov 5 05:11:13 2019 agent_3 結束 snapshot : Tue Nov 5 05:11:13 2019 Train set: Average loss:0.0022, Accuracy: 58876/60000 (98.13%) agent_1 開始 snapshot : Tue Nov 5 05:11:17 2019 agent_1 結束 snapshot : Tue Nov 5 05:11:17 2019 agent_2 開始 snapshot : Tue Nov 5 05:11:18 2019 agent_2 結束 snapshot : Tue Nov 5 05:11:18 2019 agent_3 開始 snapshot : Tue Nov 5 05:11:18 2019 agent_3 結束 snapshot : Tue Nov 5 05:11:18 2019 Test set: Average loss:0.0178, Accuracy: 9764/10000 (97.64%) 結束時間: Tue Nov 5 05:11:20 2019 </font> #### **4 agents** 1. acc ![](https://i.imgur.com/YamapJj.png) 2. loss ![](https://i.imgur.com/ma1NrDn.png) 3. confusion matrix ![](https://i.imgur.com/7uruJCH.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/15S7WZR.png) 5. training time record 花費時間 : 7 mins <font color='green'> 開始時間: Mon Nov 4 18:31:00 2019 Agent_nums = 4 Epoch 1 agent_2 開始 snapshot : Mon Nov 4 18:31:04 2019 agent_2 結束 snapshot : Mon Nov 4 18:31:04 2019 agent_3 開始 snapshot : Mon Nov 4 18:31:08 2019 agent_3 結束 snapshot : Mon Nov 4 18:31:08 2019 agent_4 開始 snapshot : Mon Nov 4 18:31:12 2019 agent_4 結束 snapshot : Mon Nov 4 18:31:12 2019 Train set: Average loss:0.0769, Accuracy: 14943/60000 (24.91%) agent_1 開始 snapshot : Mon Nov 4 18:31:19 2019 agent_1 結束 snapshot : Mon Nov 4 18:31:19 2019 agent_2 開始 snapshot : Mon Nov 4 18:31:20 2019 agent_2 結束 snapshot : Mon Nov 4 18:31:20 2019 agent_3 開始 snapshot : Mon Nov 4 18:31:20 2019 agent_3 結束 snapshot : Mon Nov 4 18:31:20 2019 agent_4 開始 snapshot : Mon Nov 4 18:31:20 2019 agent_4 結束 snapshot : Mon Nov 4 18:31:20 2019 Test set: Average loss:0.1387, Accuracy: 7811/10000 (78.11%) ... ... Epoch 30 agent_1 開始 snapshot : Mon Nov 4 18:38:11 2019 agent_1 結束 snapshot : Mon Nov 4 18:38:11 2019 agent_2 開始 snapshot : Mon Nov 4 18:38:15 2019 agent_2 結束 snapshot : Mon Nov 4 18:38:15 2019 agent_3 開始 snapshot : Mon Nov 4 18:38:18 2019 agent_3 結束 snapshot : Mon Nov 4 18:38:18 2019 agent_4 開始 snapshot : Mon Nov 4 18:38:22 2019 agent_4 結束 snapshot : Mon Nov 4 18:38:22 2019 Train set: Average loss:0.0030, Accuracy: 58359/60000 (97.27%) agent_1 開始 snapshot : Mon Nov 4 18:38:25 2019 agent_1 結束 snapshot : Mon Nov 4 18:38:25 2019 agent_2 開始 snapshot : Mon Nov 4 18:38:26 2019 agent_2 結束 snapshot : Mon Nov 4 18:38:26 2019 agent_3 開始 snapshot : Mon Nov 4 18:38:26 2019 agent_3 結束 snapshot : Mon Nov 4 18:38:26 2019 agent_4 開始 snapshot : Mon Nov 4 18:38:26 2019 agent_4 結束 snapshot : Mon Nov 4 18:38:26 2019 Test set: Average loss:0.0303, Accuracy: 9605/10000 (96.05%) 結束時間: Mon Nov 4 18:38:27 2019 </font> ### 4-2. MC + AlexNet ( 50 epochs ) #### **2 agents** 1. acc ![](https://i.imgur.com/wd16U6k.png) 2. loss ![](https://i.imgur.com/f1IFqYz.png) 3. confusion matrix ![](https://i.imgur.com/i1P7T4t.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/yRyTVCf.png) 5. training time record 花費時間 : 1 hr 7 mins <font color='green'> 開始時間: Tue Nov 5 16:39:32 2019 Agent_nums = 2 Epoch 1 agent_2 開始 snapshot : Tue Nov 5 16:40:08 2019 agent_2 結束 snapshot : Tue Nov 5 16:40:09 2019 Train set: Average loss:0.0208, Accuracy: 13947/24812 (56.21%) agent_1 開始 snapshot : Tue Nov 5 16:40:45 2019 agent_1 結束 snapshot : Tue Nov 5 16:40:45 2019 agent_2 開始 snapshot : Tue Nov 5 16:40:48 2019 agent_2 結束 snapshot : Tue Nov 5 16:40:49 2019 Test set: Average loss:0.1659, Accuracy: 1651/2754 (59.95%) ... ... Epoch 50 agent_1 開始 snapshot : Tue Nov 5 17:44:58 2019 agent_1 結束 snapshot : Tue Nov 5 17:44:59 2019 agent_2 開始 snapshot : Tue Nov 5 17:45:35 2019 agent_2 結束 snapshot : Tue Nov 5 17:45:35 2019 Train set: Average loss:0.0027, Accuracy: 24065/24812 (96.99%) agent_1 開始 snapshot : Tue Nov 5 17:46:12 2019 agent_1 結束 snapshot : Tue Nov 5 17:46:12 2019 agent_2 開始 snapshot : Tue Nov 5 17:46:15 2019 agent_2 結束 snapshot : Tue Nov 5 17:46:16 2019 Test set: Average loss:0.0361, Accuracy: 2626/2754 (95.35%) 結束時間: Tue Nov 5 17:46:20 2019 </font> #### **3 agents** 1. acc ![](https://i.imgur.com/wJXkAO2.png) 2. loss ![](https://i.imgur.com/lSTyWeC.png) 3. confusion matrix ![](https://i.imgur.com/xaPvQzD.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/gfajCs9.png) 5. training time record 花費時間 : 1 hr 8 mins <font color='green'> 開始時間: Tue Nov 5 05:24:44 2019 Agent_nums = 3 Epoch 1 agent_2 開始 snapshot : Tue Nov 5 05:25:08 2019 agent_2 結束 snapshot : Tue Nov 5 05:25:09 2019 agent_3 開始 snapshot : Tue Nov 5 05:25:33 2019 agent_3 結束 snapshot : Tue Nov 5 05:25:33 2019 Train set: Average loss:0.0310, Accuracy: 14052/24812 (56.63%) agent_1 開始 snapshot : Tue Nov 5 05:25:57 2019 agent_1 結束 snapshot : Tue Nov 5 05:25:58 2019 agent_2 開始 snapshot : Tue Nov 5 05:26:00 2019 agent_2 結束 snapshot : Tue Nov 5 05:26:00 2019 agent_3 開始 snapshot : Tue Nov 5 05:26:02 2019 agent_3 結束 snapshot : Tue Nov 5 05:26:03 2019 Test set: Average loss:0.2186, Accuracy: 1758/2754 (63.83%) ... ... Epoch 50 agent_1 開始 snapshot : Tue Nov 5 06:30:58 2019 agent_1 結束 snapshot : Tue Nov 5 06:30:58 2019 agent_2 開始 snapshot : Tue Nov 5 06:31:22 2019 agent_2 結束 snapshot : Tue Nov 5 06:31:23 2019 agent_3 開始 snapshot : Tue Nov 5 06:31:47 2019 agent_3 結束 snapshot : Tue Nov 5 06:31:47 2019 Train set: Average loss:0.0040, Accuracy: 24086/24812 (97.07%) agent_1 開始 snapshot : Tue Nov 5 06:32:12 2019 agent_1 結束 snapshot : Tue Nov 5 06:32:12 2019 agent_2 開始 snapshot : Tue Nov 5 06:32:14 2019 agent_2 結束 snapshot : Tue Nov 5 06:32:15 2019 agent_3 開始 snapshot : Tue Nov 5 06:32:17 2019 agent_3 結束 snapshot : Tue Nov 5 06:32:17 2019 Test set: Average loss:0.0511, Accuracy: 2615/2754 (94.95%) 結束時間: Tue Nov 5 06:32:20 2019 </font> #### **4 agents** 1. acc ![](https://i.imgur.com/IPVN9kj.png) 2. loss ![](https://i.imgur.com/jwaRozi.png) 3. confusion matrix ![](https://i.imgur.com/JMvUCX3.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/uDYpPby.png) 5. training time record 花費時間 : 1 hr 8 mins <font color='green'> 開始時間: Mon Nov 4 18:52:20 2019 Agent_nums = 4 Epoch 1 agent_2 開始 snapshot : Mon Nov 4 18:52:38 2019 agent_2 結束 snapshot : Mon Nov 4 18:52:39 2019 agent_3 開始 snapshot : Mon Nov 4 18:52:57 2019 agent_3 結束 snapshot : Mon Nov 4 18:52:57 2019 agent_4 開始 snapshot : Mon Nov 4 18:53:16 2019 agent_4 結束 snapshot : Mon Nov 4 18:53:16 2019 Train set: Average loss:0.0401, Accuracy: 14402/24812 (58.04%) agent_1 開始 snapshot : Mon Nov 4 18:53:35 2019 agent_1 結束 snapshot : Mon Nov 4 18:53:35 2019 agent_2 開始 snapshot : Mon Nov 4 18:53:37 2019 agent_2 結束 snapshot : Mon Nov 4 18:53:37 2019 agent_3 開始 snapshot : Mon Nov 4 18:53:39 2019 agent_3 結束 snapshot : Mon Nov 4 18:53:39 2019 agent_4 開始 snapshot : Mon Nov 4 18:53:41 2019 agent_4 結束 snapshot : Mon Nov 4 18:53:41 2019 Test set: Average loss:0.3213, Accuracy: 1754/2754 (63.69%) ... ... Epoch 50 agent_1 開始 snapshot : Mon Nov 4 19:58:47 2019 agent_1 結束 snapshot : Mon Nov 4 19:58:47 2019 agent_2 開始 snapshot : Mon Nov 4 19:59:06 2019 agent_2 結束 snapshot : Mon Nov 4 19:59:06 2019 agent_3 開始 snapshot : Mon Nov 4 19:59:24 2019 agent_3 結束 snapshot : Mon Nov 4 19:59:25 2019 agent_4 開始 snapshot : Mon Nov 4 19:59:43 2019 agent_4 結束 snapshot : Mon Nov 4 19:59:43 2019 Train set: Average loss:0.0059, Accuracy: 23924/24812 (96.42%) agent_1 開始 snapshot : Mon Nov 4 20:00:00 2019 agent_1 結束 snapshot : Mon Nov 4 20:00:00 2019 agent_2 開始 snapshot : Mon Nov 4 20:00:02 2019 agent_2 結束 snapshot : Mon Nov 4 20:00:03 2019 agent_3 開始 snapshot : Mon Nov 4 20:00:04 2019 agent_3 結束 snapshot : Mon Nov 4 20:00:05 2019 agent_4 開始 snapshot : Mon Nov 4 20:00:06 2019 agent_4 結束 snapshot : Mon Nov 4 20:00:07 2019 Test set: Average loss:0.0697, Accuracy: 2613/2754 (94.88%) 結束時間: Mon Nov 4 20:00:09 2019 </font> ### 4-3. OCT + VggNet16 ( 50 epochs ) #### **2 agents** 1. acc ![](https://i.imgur.com/y8ey5Ed.png) 2. loss ![](https://i.imgur.com/ubg3hGN.png) 3. confusion matrix ![](https://i.imgur.com/PpoCmnz.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/MDMcmNN.png) 5. training time record 花費時間 : 9 hrs 54 mins <font color='green'> 開始時間: Tue Nov 5 17:47:59 2019 Agent_nums = 2 Epoch 1 agent_2 開始 snapshot : Tue Nov 5 17:53:45 2019 agent_2 結束 snapshot : Tue Nov 5 17:53:47 2019 Train set: Average loss:0.0013, Accuracy: 14223/32000 (44.45%) agent_1 開始 snapshot : Tue Nov 5 17:59:29 2019 agent_1 結束 snapshot : Tue Nov 5 17:59:31 2019 agent_2 開始 snapshot : Tue Nov 5 17:59:36 2019 agent_2 結束 snapshot : Tue Nov 5 17:59:37 2019 Test set: Average loss:0.0186, Accuracy: 610/968 (63.02%) ... ... Epoch 50 agent_1 開始 snapshot : Wed Nov 6 03:20:01 2019 agent_1 結束 snapshot : Wed Nov 6 03:20:02 2019 agent_2 開始 snapshot : Wed Nov 6 03:25:47 2019 agent_2 結束 snapshot : Wed Nov 6 03:25:49 2019 Train set: Average loss:0.0000, Accuracy: 31561/32000 (98.63%) agent_1 開始 snapshot : Wed Nov 6 03:31:29 2019 agent_1 結束 snapshot : Wed Nov 6 03:31:31 2019 agent_2 開始 snapshot : Wed Nov 6 03:31:35 2019 agent_2 結束 snapshot : Wed Nov 6 03:31:37 2019 Test set: Average loss:0.0002, Accuracy: 967/968 (99.90%) 結束時間: Wed Nov 6 03:31:42 2019 </font> #### **3 agents** 1. acc ![](https://i.imgur.com/vNp8toG.png) 2. loss ![](https://i.imgur.com/ZbC2lZ0.png) 3. confusion matrix ![](https://i.imgur.com/5O2ykAs.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/bx45Xko.png) 5. training time record 花費時間 : 9 hrs 53 mins <font color='green'> 開始時間: Tue Nov 5 06:34:00 2019 Agent_nums = 3 Epoch 1 agent_2 開始 snapshot : Tue Nov 5 06:37:48 2019 agent_2 結束 snapshot : Tue Nov 5 06:37:50 2019 agent_3 開始 snapshot : Tue Nov 5 06:41:38 2019 agent_3 結束 snapshot : Tue Nov 5 06:41:40 2019 Train set: Average loss:0.0020, Accuracy: 10487/32000 (32.77%) agent_1 開始 snapshot : Tue Nov 5 06:45:24 2019 agent_1 結束 snapshot : Tue Nov 5 06:45:26 2019 agent_2 開始 snapshot : Tue Nov 5 06:45:29 2019 agent_2 結束 snapshot : Tue Nov 5 06:45:31 2019 agent_3 開始 snapshot : Tue Nov 5 06:45:34 2019 agent_3 結束 snapshot : Tue Nov 5 06:45:35 2019 Test set: Average loss:0.0271, Accuracy: 613/968 (63.33%) ... ... Epoch 50 agent_1 開始 snapshot : Tue Nov 5 16:05:33 2019 agent_1 結束 snapshot : Tue Nov 5 16:05:35 2019 agent_2 開始 snapshot : Tue Nov 5 16:09:24 2019 agent_2 結束 snapshot : Tue Nov 5 16:09:26 2019 agent_3 開始 snapshot : Tue Nov 5 16:13:13 2019 agent_3 結束 snapshot : Tue Nov 5 16:13:15 2019 Train set: Average loss:0.0000, Accuracy: 31658/32000 (98.93%) agent_1 開始 snapshot : Tue Nov 5 16:17:00 2019 agent_1 結束 snapshot : Tue Nov 5 16:17:02 2019 agent_2 開始 snapshot : Tue Nov 5 16:17:04 2019 agent_2 結束 snapshot : Tue Nov 5 16:17:06 2019 agent_3 開始 snapshot : Tue Nov 5 16:17:09 2019 agent_3 結束 snapshot : Tue Nov 5 16:17:11 2019 Test set: Average loss:0.0002, Accuracy: 967/968 (99.90%) 結束時間: Tue Nov 5 16:17:14 2019 </font> #### **4 agents** 1. acc ![](https://i.imgur.com/qxQyjOp.png) 2. loss ![](https://i.imgur.com/0otNVrQ.png) 3. confusion matrix ![](https://i.imgur.com/H0lh9ZP.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/ml5tacd.png) 5. training time record 花費時間 : 9 hrs 59 mins <font color='green'> 開始時間: Mon Nov 4 20:01:49 2019 Agent_nums = 4 Epoch 1 agent_2 開始 snapshot : Mon Nov 4 20:04:39 2019 agent_2 結束 snapshot : Mon Nov 4 20:04:41 2019 agent_3 開始 snapshot : Mon Nov 4 20:07:31 2019 agent_3 結束 snapshot : Mon Nov 4 20:07:33 2019 agent_4 開始 snapshot : Mon Nov 4 20:10:21 2019 agent_4 結束 snapshot : Mon Nov 4 20:10:23 2019 Train set: Average loss:0.0022, Accuracy: 16595/32000 (51.86%) agent_1 開始 snapshot : Mon Nov 4 20:12:21 2019 agent_1 結束 snapshot : Mon Nov 4 20:12:23 2019 agent_2 開始 snapshot : Mon Nov 4 20:12:26 2019 agent_2 結束 snapshot : Mon Nov 4 20:12:27 2019 agent_3 開始 snapshot : Mon Nov 4 20:12:30 2019 agent_3 結束 snapshot : Mon Nov 4 20:12:31 2019 agent_4 開始 snapshot : Mon Nov 4 20:12:34 2019 agent_4 結束 snapshot : Mon Nov 4 20:12:35 2019 Test set: Average loss:0.0199, Accuracy: 805/968 (83.16%) ... ... Epoch 50 agent_1 開始 snapshot : Tue Nov 5 04:51:39 2019 agent_1 結束 snapshot : Tue Nov 5 04:51:41 2019 agent_2 開始 snapshot : Tue Nov 5 04:54:31 2019 agent_2 結束 snapshot : Tue Nov 5 04:54:33 2019 agent_3 開始 snapshot : Tue Nov 5 04:57:22 2019 agent_3 結束 snapshot : Tue Nov 5 04:57:24 2019 agent_4 開始 snapshot : Tue Nov 5 05:00:13 2019 agent_4 結束 snapshot : Tue Nov 5 05:00:15 2019 Train set: Average loss:0.0000, Accuracy: 31627/32000 (98.83%) agent_1 開始 snapshot : Tue Nov 5 05:02:12 2019 agent_1 結束 snapshot : Tue Nov 5 05:02:13 2019 agent_2 開始 snapshot : Tue Nov 5 05:02:16 2019 agent_2 結束 snapshot : Tue Nov 5 05:02:17 2019 agent_3 開始 snapshot : Tue Nov 5 05:02:20 2019 agent_3 結束 snapshot : Tue Nov 5 05:02:21 2019 agent_4 開始 snapshot : Tue Nov 5 05:02:24 2019 agent_4 結束 snapshot : Tue Nov 5 05:02:25 2019 Test set: Average loss:0.0019, Accuracy: 954/968 (98.55%) 結束時間: Tue Nov 5 05:02:28 2019 </font> ### 4-4. ECG + MLP ( 30 epochs ) #### **2 agents** 1. acc ![](https://i.imgur.com/dzwWQE8.png) 2. loss ![](https://i.imgur.com/JCJZeVJ.png) 3. confusion matrix ![](https://i.imgur.com/NjI5X9i.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/h4hbwMT.png) 5. training time record 花費時間 : 10 mins <font color='green'> 開始時間: Tue Nov 5 16:27:45 2019 Agent_nums = 2 Epoch 1 agent_2 開始 snapshot : Tue Nov 5 16:27:54 2019 agent_2 結束 snapshot : Tue Nov 5 16:27:54 2019 Train set: Average loss:0.0046, Accuracy: 78304/87554 (89.44%) agent_1 開始 snapshot : Tue Nov 5 16:28:03 2019 agent_1 結束 snapshot : Tue Nov 5 16:28:04 2019 agent_2 開始 snapshot : Tue Nov 5 16:28:04 2019 agent_2 結束 snapshot : Tue Nov 5 16:28:05 2019 Test set: Average loss:0.0095, Accuracy: 20593/21892 (94.07%) ... ... Epoch 30 agent_1 開始 snapshot : Tue Nov 5 16:37:30 2019 agent_1 結束 snapshot : Tue Nov 5 16:37:30 2019 agent_2 開始 snapshot : Tue Nov 5 16:37:40 2019 agent_2 結束 snapshot : Tue Nov 5 16:37:40 2019 Train set: Average loss:0.0003, Accuracy: 86735/87554 (99.06%) agent_1 開始 snapshot : Tue Nov 5 16:37:49 2019 agent_1 結束 snapshot : Tue Nov 5 16:37:50 2019 agent_2 開始 snapshot : Tue Nov 5 16:37:50 2019 agent_2 結束 snapshot : Tue Nov 5 16:37:50 2019 Test set: Average loss:0.0048, Accuracy: 21402/21892 (97.76%) 結束時間: Tue Nov 5 16:37:52 2019 </font> #### **3 agents** 1. acc ![](https://i.imgur.com/DpXDTKE.png) 2. loss ![](https://i.imgur.com/dv6dgYq.png) 3. confusion matrix ![](https://i.imgur.com/W1CEyw6.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/JNjam3S.png) 5. training time record 花費時間 : 10 mins <font color='green'> 開始時間: Tue Nov 5 05:12:59 2019 Agent_nums = 3 Epoch 1 agent_2 開始 snapshot : Tue Nov 5 05:13:05 2019 agent_2 結束 snapshot : Tue Nov 5 05:13:05 2019 agent_3 開始 snapshot : Tue Nov 5 05:13:11 2019 agent_3 結束 snapshot : Tue Nov 5 05:13:11 2019 Train set: Average loss:0.0068, Accuracy: 78355/87554 (89.49%) agent_1 開始 snapshot : Tue Nov 5 05:13:17 2019 agent_1 結束 snapshot : Tue Nov 5 05:13:17 2019 agent_2 開始 snapshot : Tue Nov 5 05:13:18 2019 agent_2 結束 snapshot : Tue Nov 5 05:13:18 2019 agent_3 開始 snapshot : Tue Nov 5 05:13:18 2019 agent_3 結束 snapshot : Tue Nov 5 05:13:19 2019 Test set: Average loss:0.0138, Accuracy: 20573/21892 (93.97%) ... ... Epoch 30 agent_1 開始 snapshot : Tue Nov 5 05:22:43 2019 agent_1 結束 snapshot : Tue Nov 5 05:22:43 2019 agent_2 開始 snapshot : Tue Nov 5 05:22:49 2019 agent_2 結束 snapshot : Tue Nov 5 05:22:49 2019 agent_3 開始 snapshot : Tue Nov 5 05:22:55 2019 agent_3 結束 snapshot : Tue Nov 5 05:22:56 2019 Train set: Average loss:0.0005, Accuracy: 86712/87554 (99.04%) agent_1 開始 snapshot : Tue Nov 5 05:23:02 2019 agent_1 結束 snapshot : Tue Nov 5 05:23:02 2019 agent_2 開始 snapshot : Tue Nov 5 05:23:02 2019 agent_2 結束 snapshot : Tue Nov 5 05:23:02 2019 agent_3 開始 snapshot : Tue Nov 5 05:23:03 2019 agent_3 結束 snapshot : Tue Nov 5 05:23:03 2019 Test set: Average loss:0.0074, Accuracy: 21339/21892 (97.47%) 結束時間: Tue Nov 5 05:23:05 2019 </font> #### **4 agents** 1. acc ![](https://i.imgur.com/L0pkDP3.png) 2. loss ![](https://i.imgur.com/cqq8Bj9.png) 3. confusion matrix ![](https://i.imgur.com/SzlTD3B.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/ygOTjx4.png) 5. training time record 花費時間 : 10 mins <font color='green'> 開始時間: Mon Nov 4 18:40:07 2019 Agent_nums = 4 Epoch 1 agent_2 開始 snapshot : Mon Nov 4 18:40:11 2019 agent_2 結束 snapshot : Mon Nov 4 18:40:11 2019 agent_3 開始 snapshot : Mon Nov 4 18:40:16 2019 agent_3 結束 snapshot : Mon Nov 4 18:40:16 2019 agent_4 開始 snapshot : Mon Nov 4 18:40:20 2019 agent_4 結束 snapshot : Mon Nov 4 18:40:21 2019 Train set: Average loss:0.0093, Accuracy: 78204/87554 (89.32%) agent_1 開始 snapshot : Mon Nov 4 18:40:27 2019 agent_1 結束 snapshot : Mon Nov 4 18:40:27 2019 agent_2 開始 snapshot : Mon Nov 4 18:40:28 2019 agent_2 結束 snapshot : Mon Nov 4 18:40:28 2019 agent_3 開始 snapshot : Mon Nov 4 18:40:28 2019 agent_3 結束 snapshot : Mon Nov 4 18:40:28 2019 agent_4 開始 snapshot : Mon Nov 4 18:40:29 2019 agent_4 結束 snapshot : Mon Nov 4 18:40:29 2019 Test set: Average loss:0.0193, Accuracy: 20534/21892 (93.80%) ... ... Epoch 30 agent_1 開始 snapshot : Mon Nov 4 18:50:17 2019 agent_1 結束 snapshot : Mon Nov 4 18:50:17 2019 agent_2 開始 snapshot : Mon Nov 4 18:50:22 2019 agent_2 結束 snapshot : Mon Nov 4 18:50:22 2019 agent_3 開始 snapshot : Mon Nov 4 18:50:27 2019 agent_3 結束 snapshot : Mon Nov 4 18:50:27 2019 agent_4 開始 snapshot : Mon Nov 4 18:50:31 2019 agent_4 結束 snapshot : Mon Nov 4 18:50:32 2019 Train set: Average loss:0.0007, Accuracy: 86668/87554 (98.99%) agent_1 開始 snapshot : Mon Nov 4 18:50:37 2019 agent_1 結束 snapshot : Mon Nov 4 18:50:37 2019 agent_2 開始 snapshot : Mon Nov 4 18:50:37 2019 agent_2 結束 snapshot : Mon Nov 4 18:50:38 2019 agent_3 開始 snapshot : Mon Nov 4 18:50:38 2019 agent_3 結束 snapshot : Mon Nov 4 18:50:38 2019 agent_4 開始 snapshot : Mon Nov 4 18:50:38 2019 agent_4 結束 snapshot : Mon Nov 4 18:50:39 2019 Test set: Average loss:0.0109, Accuracy: 21411/21892 (97.80%) 結束時間: Mon Nov 4 18:50:40 2019 </font> ## 5. 可切割模型設計差異之效能比較 (以 AlexNet + 4 agents 為例) ---------------------- #### 前述 目標:我們希望盡可能在切割完 1. `Agent 端模型參數量 < Server 端模型參數量` 2. `Agent 端模型參數量 > Server 端模型參數量` 這兩種模型後,兩者的 `Server` 、 `Agent` 參數比例是互調的,也就是說,如果前者切割後的參數比例是 `1 : 9`,那後者切割後的比例盡可能要是 `9 : 1`。 方法:我們列印出 `AlexNet 模型參數 -- 百分比累計圖`,根據此圖進行兩種模型的切割,至於此圖的參照方式,以 `name: features.3.weight percentage: 0.58%` 為例,表示在`此層之前(包含此層)的所有參數加總`佔`總模型參數`比例的 `0.58 %`,而這些層包含 `(feautres)` 中 `(0): Conv2d`、`(1): ReLU`、`(2): MaxPool2d` 、`(3): Conv2d` 等。接著我們便可以實際對模型進行切割。 AlexNet 模型參數 -- 百分比累計圖: ![](https://i.imgur.com/rww6IqY.png) `Agent 端模型參數量 < Server 端模型參數量`:我們將模型以上圖 `features`、`classifier` 分別切割給 `Agent` 、 `Server`,切割後兩者的比例是 `4.33 : 95.67`。 `Agent 端模型參數量 > Server 端模型參數量`:我們除了將 `features` 的部分分給 `Agent`,另外還增加`第一層的 fully connected layer`,也就是上圖的 `classifier.0.weight`、`classifier.0.bias`,至於 `剩下的 classifier 層` 則都分給 `Server`,切割後兩者的比例是 `70.55 : 29.45`。 我們針對這兩種模型的切割方式,已盡可能的讓 `Server` 與 `Agent` 各自佔有的參數比例互調。接著便可以實際開始進行實驗。 ---------------------- ### 5-1. Agent 端模型參數量 < Server 端模型參數量 * 模型切割 1. Agent : 卷積層五層 ( features ) 2. Server : 全連接層三層 ( classifier ) 3. 參數量比例 Agent : Server = 4 : 96 * 效能 1. acc ![](https://i.imgur.com/IPVN9kj.png) 2. loss ![](https://i.imgur.com/jwaRozi.png) 3. confusion matrix ![](https://i.imgur.com/JMvUCX3.png) 4. confusion matrix ( normalized ) ![](https://i.imgur.com/uDYpPby.png) 5. training time record 花費時間 : 1 hr 8 mins <font color='green'> 開始時間: Mon Nov 4 18:52:20 2019 Agent_nums = 4 Epoch 1 agent_2 開始 snapshot : Mon Nov 4 18:52:38 2019 agent_2 結束 snapshot : Mon Nov 4 18:52:39 2019 agent_3 開始 snapshot : Mon Nov 4 18:52:57 2019 agent_3 結束 snapshot : Mon Nov 4 18:52:57 2019 agent_4 開始 snapshot : Mon Nov 4 18:53:16 2019 agent_4 結束 snapshot : Mon Nov 4 18:53:16 2019 Train set: Average loss:0.0401, Accuracy: 14402/24812 (58.04%) agent_1 開始 snapshot : Mon Nov 4 18:53:35 2019 agent_1 結束 snapshot : Mon Nov 4 18:53:35 2019 agent_2 開始 snapshot : Mon Nov 4 18:53:37 2019 agent_2 結束 snapshot : Mon Nov 4 18:53:37 2019 agent_3 開始 snapshot : Mon Nov 4 18:53:39 2019 agent_3 結束 snapshot : Mon Nov 4 18:53:39 2019 agent_4 開始 snapshot : Mon Nov 4 18:53:41 2019 agent_4 結束 snapshot : Mon Nov 4 18:53:41 2019 Test set: Average loss:0.3213, Accuracy: 1754/2754 (63.69%) ... ... Epoch 50 agent_1 開始 snapshot : Mon Nov 4 19:58:47 2019 agent_1 結束 snapshot : Mon Nov 4 19:58:47 2019 agent_2 開始 snapshot : Mon Nov 4 19:59:06 2019 agent_2 結束 snapshot : Mon Nov 4 19:59:06 2019 agent_3 開始 snapshot : Mon Nov 4 19:59:24 2019 agent_3 結束 snapshot : Mon Nov 4 19:59:25 2019 agent_4 開始 snapshot : Mon Nov 4 19:59:43 2019 agent_4 結束 snapshot : Mon Nov 4 19:59:43 2019 Train set: Average loss:0.0059, Accuracy: 23924/24812 (96.42%) agent_1 開始 snapshot : Mon Nov 4 20:00:00 2019 agent_1 結束 snapshot : Mon Nov 4 20:00:00 2019 agent_2 開始 snapshot : Mon Nov 4 20:00:02 2019 agent_2 結束 snapshot : Mon Nov 4 20:00:03 2019 agent_3 開始 snapshot : Mon Nov 4 20:00:04 2019 agent_3 結束 snapshot : Mon Nov 4 20:00:05 2019 agent_4 開始 snapshot : Mon Nov 4 20:00:06 2019 agent_4 結束 snapshot : Mon Nov 4 20:00:07 2019 Test set: Average loss:0.0697, Accuracy: 2613/2754 (94.88%) 結束時間: Mon Nov 4 20:00:09 2019 </font> ### 5-2. Agent 端模型參數量 > Server 端模型參數量 * 模型切割 1. Agent : 卷積層五層 + 全連接層一層 2. Server : 全連接層二層 3. 參數量比例 Agent : Server = 70 : 30 * 效能 1. acc ![](https://i.imgur.com/BufAq2p.png) 2. loss ![](https://i.imgur.com/KBggZgZ.png) 4. confusion matrix ![](https://i.imgur.com/TDV8dFP.png) 6. confusion matrix ( normalized ) ![](https://i.imgur.com/Jgb9ti6.png) 7. training time record 花費時間 : 1 hr 28 mins <font color='green'> 開始時間: Wed Nov 6 12:54:46 2019 Agent_nums = 4 Epoch 1 agent_2 開始 snapshot : Wed Nov 6 12:55:01 2019 agent_2 結束 snapshot : Wed Nov 6 12:55:07 2019 agent_3 開始 snapshot : Wed Nov 6 12:55:22 2019 agent_3 結束 snapshot : Wed Nov 6 12:55:27 2019 agent_4 開始 snapshot : Wed Nov 6 12:55:42 2019 agent_4 結束 snapshot : Wed Nov 6 12:55:48 2019 Train set: Average loss:0.0403, Accuracy: 14423/24812 (58.13%) agent_1 開始 snapshot : Wed Nov 6 12:56:03 2019 agent_1 結束 snapshot : Wed Nov 6 12:56:08 2019 agent_2 開始 snapshot : Wed Nov 6 12:56:10 2019 agent_2 結束 snapshot : Wed Nov 6 12:56:15 2019 agent_3 開始 snapshot : Wed Nov 6 12:56:17 2019 agent_3 結束 snapshot : Wed Nov 6 12:56:22 2019 agent_4 開始 snapshot : Wed Nov 6 12:56:23 2019 agent_4 結束 snapshot : Wed Nov 6 12:56:29 2019 Test set: Average loss:0.3301, Accuracy: 1770/2754 (64.27%) ... ... Epoch 50 agent_1 開始 snapshot : Wed Nov 6 14:21:02 2019 agent_1 結束 snapshot : Wed Nov 6 14:21:07 2019 agent_2 開始 snapshot : Wed Nov 6 14:21:22 2019 agent_2 結束 snapshot : Wed Nov 6 14:21:27 2019 agent_3 開始 snapshot : Wed Nov 6 14:21:42 2019 agent_3 結束 snapshot : Wed Nov 6 14:21:47 2019 agent_4 開始 snapshot : Wed Nov 6 14:22:02 2019 agent_4 結束 snapshot : Wed Nov 6 14:22:07 2019 Train set: Average loss:0.0054, Accuracy: 24027/24812 (96.84%) agent_1 開始 snapshot : Wed Nov 6 14:22:21 2019 agent_1 結束 snapshot : Wed Nov 6 14:22:26 2019 agent_2 開始 snapshot : Wed Nov 6 14:22:27 2019 agent_2 結束 snapshot : Wed Nov 6 14:22:33 2019 agent_3 開始 snapshot : Wed Nov 6 14:22:34 2019 agent_3 結束 snapshot : Wed Nov 6 14:22:40 2019 agent_4 開始 snapshot : Wed Nov 6 14:22:41 2019 agent_4 結束 snapshot : Wed Nov 6 14:22:47 2019 Test set: Average loss:0.0795, Accuracy: 2581/2754 (93.72%) 結束時間: Wed Nov 6 14:22:48 2019 </font> ## 6. 效能分析 ### 6-1. 集中式與分散式 ![](https://i.imgur.com/oClYIIG.png) * 備註 : 表格中紀錄為最後一個 epoch 的結果,並非訓練中最好的一次效能,原因是為求公正,讓同一個資料集的 epoch 數相同,並無設置 early stop。若希望觀察最佳效能,可觀察所附之 accuracy plot。 1. 效能 : 分散式效能並無顯著下降,若列出最佳之效能做比較,應會幾乎無差異。 2. 效率 : 訓練模型程式大致相同,造成效率下降為網路延遲,網路延遲主因包含 : snapshot ( between agents )、send feature ( from agent to server )、send gradient ( from server to agent )。在 MNIST、MC、ECG 當中,分散式時間增加幅度小,可以歸咎為不可避免之網路延遲。但是在 OCT 中集中式與分散式增加幅度極大,而 OCT 分散式增加 agent 數目,所增加的時間不顯著,可以得知 snapshot 影響程度相對不顯著,推論為兩種原因造成 : 1. 根據 1-1,OCT 的訓練資料量為 32000 筆 ( 256, 256 )圖片,為四筆資料中最高,在 send feature、send gradient 的部分所造成的延遲也相對較高。 2. 根據 1-2, agent 端所分配層數為前 16 層 ( features ),且其參數量幾乎為整個模型的參數量 ( 98.9% ) ,server 端則比較少 ( 1.1% ),這也是 snapshot 不顯著的原因,從第四部份可知,agent 端顯卡配置為 2060、server 端為 2080ti,因此分散式幾乎是以 2060 的效率進行。 ### 6-2. 分散式參數量切割 ( MC + AlexNet, 4 agents ) ![](https://i.imgur.com/GDUGrjg.png) 1. Train accuracy : 效能相近。 2. Test accuracy : Agent 參數量較大時看似效能較低,試觀察兩者之 accuracy plot : ![](https://i.imgur.com/hLHvuQQ.png) 至高點極為相近,故也能解釋為效能相近。 3. Time : Agent 參數量較大時,增加了 20 分鐘,試比較兩者 snapshot 紀錄 ![](https://i.imgur.com/Y4cOIVF.png) 根據 1-4 每一個 agent 每一個 epoch 會 snapshot 兩次,而本次測試共有 4 agents、50 epochs,總計會有 4 ( agents ) x 50 ( epochs ) x 2 ( 每 epoch 兩次 ) = 400 次 snapshot,若每次增加 4 sec,總共增加 1600 sec,也就是大約 26 mins,與實驗結果相近。 而並沒有增加 26 mins,而只有 20 mins,推論是因為 send feature ( from agent to server )、send gradient ( from server to agent ) 部分所花費時間減少所抵銷,原本 feature dimention 為 batch_size * 9216,切割後降為 batch_size * 4096。 ## 附件: 效能測試相關原始碼 * dataSet/dataset_import.py * train/central_train/central_train.py * train/distributed_train/server_and_agents/distributed_agent_train.py * train/distributed_train/server_and_agents/distributed_server_train.py * ( 不一定要放 ) train/distributed_train/agent.py * ( 不一定要放 ) train/distributed_train/server.py ## The Other Item

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully