AI BERT - HackMD

--- title: AI BERT tags: APAC HPC-AI competition --- # AI BERT pre-knowledge [TOC] --- [Benchmark](https://www.hpcadvisorycouncil.com/events/2020/APAC-AI-HPC/pdf/HPC-AI_Competition_BERT-LARGE_Benchmark_guidelines.pdf) [libnccl-dev_2.5.6-1+cuda10.2_amd64.deb](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnccl-dev_2.5.6-1+cuda10.2_amd64.deb) ## GLUE Benchmark (General Language Understanding Evaluation) ### Step1 Download benchmark data [benchmark data](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) ### Step2 Run the downloaded data run `# python download_glue_data.py --data_dir glue_data --tasks all` ### Step3 Download benchmark codes run `# git clone https://github.com/NVIDIA/DeepLearningExamples.git` ### Step4 Download BERT-Large model file run `# https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H- 1024_A-16.zip ` --- [Install MOFED](https://community.mellanox.com/s/article/howto-install-mlnx-ofed-driver) ![](https://i.imgur.com/t8jus5C.png) - [x] TensorFlow-GPU installed, with support of NCCL Horovod - [x] The BERT Large model file - [ ] The NMLI dataset for fine-tune training tasks - [ ] MLNX OFED installed - [ ] A copy of workable OpenMPI-4 at /usr/mpi/gcc/openmpi-4.0.2rc3/bin/mpirun, installed by OFED driver --- ## Three types of model <table> <thead> <tr> <th>Criteria</th> <th>Supervised ML</th> <th>Unsupervised ML</th> <th>Reinforcement ML</th> </tr> </thead> <tbody> <tr> <td>Definition</td> <td>Learns by using labelled data</td> <td>Trained using unlabelled data without any guidance.</td> <td>Works on interacting with the environment</td> </tr> <tr> <td>Type of data</td> <td>Labelled data</td> <td>Unlabelled data</td> <td>No – predefined data</td> </tr> <tr> <td>Type of problems</td> <td>Regression and classification</td> <td>Association and Clustering</td> <td>Exploitation or Exploration</td> </tr> <tr> <td>Supervision</td> <td>Extra supervision</td> <td>No supervision</td> <td>No supervision</td> </tr> <tr> <td>Algorithms</td> <td>Linear Regression, Logistic Regression, SVM, KNN etc.</td> <td>K – Means, C – Means, Apriori</td> <td>Q – Learning, SARSA</td> </tr> <tr> <td>Aim</td> <td>Calculate outcomes</td> <td>Discover underlying patterns</td> <td>Learn a series of action</td> </tr> <tr> <td>Application</td> <td>Risk Evaluation, Forecast Sales</td> <td>Recommendation System, Anomaly Detection</td> <td>Self Driving Cars, Gaming, Healthcare</td> </tr> </tbody> </table> Goal: 你要辨識出 apple supervised: train input: 很多水果照片附上每張是什麼的label final input: 給我 apple final output: apple picture unsupervise: train input: 一堆水果照片 training 過程中：嘗試output apple, 我們告訴model對不對, model 自己修改權重參數繼續train 最後一樣希望可以辨識apple 啊 reinforcement 我還不甚理解基本上不會給予訓練的資料，每次讓機器去探索，而探索都會有一個結果，根據結果的好壞機器會自行學習，到最後在==任意==狀態下，機器都有辦法判斷哪一種方法會產生最好的結果 training的過程中，好像不會去告訴model是否正確，這個我覺得應該是機器自己抓取特徵成為它判斷的依據，然而這些特徵是否正確，就要看給定的資料。我看到的資料，會人為介入告訴機器答案的應該是supervised 我猜unsupervised是利用訓練的資料來確定訓練出來就是我們要的結果。比如說要判斷一個植物是花還是葉給予的圖片資料裡面讓機器學習出，若特徵是綠色 -> 結果會是葉；若特徵是紅色 -> 結果會是花如果學習的資料沒有很完整的話，機器可能會判斷錯誤這兩種最主要的差異在於，資料是否有標記。最主要是當資料太大，要去標記資料會很麻煩現在好像會採用半監督式(部分資料標記，大部分資料未標記，透過給定現有的標記(pattern)，去對未標記的資料進行特徵萃取) p.s. label tag 應該都差不多，意思到就好 --- ## NLP - Goal: 讓機器學會分析人類語言的語意，可以用來處理大量的文字資料 --- ## Transformer Network ### Sequence to Sequence Learning(Seq2Seq) 利用類神經網路，以一段Sequence當作Input(可能是一段句子)透過模型轉換成另一段Sequence作為Output。 Seq2Seq 模型包含Encoder & Decoder Encoder 負責將Input做轉換，轉換到一個向量空間(n-dimensional vector) Decoder 再接取那個向量，將其轉換成Output。 Example: 將中文轉換成英文。 Encoder 將中文轉換成一種抽象的虛構語言。 Decoder 接取虛構的語言，再將其轉換成英文。那Encoder和Decoder怎麼知道要如何轉換成虛構語言/讀取虛構語言，而這就是需要Train的地方。 ### Attention > The attention-mechanism looks at an input sequence and decides at each step which other parts of the sequence are important. 簡單來說，看著一篇文章，你會專注在眼前的詞上，但你的腦中會將這些詞跟先前的詞做比對、組合，藉此了解一段文章的「語意」。套用在Seq2Seq之中，就會是Encoder在轉換成虛構語言時，不只轉換Sequence至虛構語言，同時會把能夠代表該Sequence之語意的「Keyword」一併轉換過去(可能是透過Weighting Value的方式)，Decoder就可根據這些資訊，重建出最符合語意的Sequece。 - Attention Mechanism: Decoder產生輸出時，關注Encoder的輸出，從中獲得上下文資訊 - Self-Attention Mechanism: Encoder & Decoder產生輸出時，會關注自己序列的其他元素，從中獲得文意資訊 - 拿Q對每個K做attention(function有很多不同的做法)，產出attention weight vector - 將attention weight vector帶入sotemax function做normalization - 將normalization後的weight vector和value vector做內積，產出Output - Multi-Head Attention: - Q、K、V再去細分成n個部分，QKV只會對對應的部分作self-attention，目的在於每個細分出的部分所關注的重點不同，可以更詳細的關注自己想關注的重點 ![image alt](https://miro.medium.com/max/963/1*ETe4WrKJ1lS1MKDgBPIM0g.png) > $Attention(Q,K,V) = sfotmax( \dfrac{QK^T}{\sqrt{d_k}} )V$ - `Q`: Matrix contains the query - vector representation of one word in the sequence - `K`: The keys - vector representations of all the words in the sequence - `V`: The values - again the vector representations of all the words in the sequence. - `d`: dimension of Q and K ### The Transformer ![image alt](https://miro.medium.com/max/963/1*BHzGVskWGS_3jEcYYi6miQ.png) - self-attention的機制讓中間的運算平行化，同時Output也都會看過所有的Input - 取代RNN，因RNN無法平行運算 - Seq2Seq model + self-attention - `Nx`: 表示Encoder/Decoder可以相互堆疊N層 - 做N次 - `Pos encoding`: 由於沒有RNN，所以需要利用position對seq的元素進行編號。 - These positions are added to the embedded representation (n-dimensional vector) of each word - `Add & Norm`: - `Add`: Attention Input + Attention Output - `Norm`: 上面加完之後做Layer Normaliztion - `Masked Multi-Head Attention`: Atten on the generated seq [尚未整理完](https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04) [Transformer 中文](https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html) [BERT 中文](https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html) [Attention is All You Need](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf) --- ## About BERT **BERT**, or **B**idirectional **E**ncoder **R**epresentations from **T**ransformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. ### About BERT Large - The developer team denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. The team primarily report results on two model sizes: - `BERT-BASE (L=12, H=768, A=12, Total Parameters=110M)` and - `BERT-LARGE (L=24, H=1024, A=16, Total Parameters=340M)`. ## Terms - Corpus n. 語言資料庫 - Inference n. 結論 --- ## Reference 1. [Supervised vs Unsupervised vs Reinforcement](https://www.aitude.com/supervised-vs-unsupervised-vs-reinforcement/) 2. [What’s the Difference Between Supervised, Unsupervised, Semi-Supervised and Reinforcement Learning?](https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/) --- ### Use `2>&1` - when you wanna redirect both `stdout` and `stderr` to the same file. - basically you are saying redirect the `stderr` to the same place we are redirecting `stdout`. ```shell $ cat foo.txt > output.txt 2>&1 ``` > Use `&1` to reference the value of the file descriptor1(`stdout`). > - this command makes the `stdout` and `stderr` of foo.txt all go to output.txt. <table> <thead> <tr> <th>file descriptor</th> <th>std</th> </tr> </thead> <tbody> <tr> <th>0</th> <th>stdin</th> </tr> <tr> <th>1</th> <th>stdout</th> </tr> <tr> <th>2</th> <th>stderr</th> </tr> </tbody> </table> [Understanding 2>&1](https://www.brianstorti.com/understanding-shell-script-idiom-redirect/) ## NCCL(NVIDIA Collective Communications Library) * [NVIDIA](https://developer.nvidia.com/nccl) * [introduction](https://on-demand.gputechconf.com/gtc/2018/video/S8462/) ### [Horovod](https://horovod.readthedocs.io/en/stable/) * Horovod is combining NCCL and MPI into an wrapper for Distributed Deep Learning in for example TensorFlow. * It can detect if GPU Direct via RDMA makes sense in the current hardware topology and uses it transparently. * [on github](https://github.com/horovod/horovod) * [Horovod with TensorFlow](https://horovod.readthedocs.io/en/stable/tensorflow.html) ### MPI * Message Passing Interface * 訊息傳遞介面是一個平行計算的應用程式介面（API），常在超級電腦、電腦叢集等非共享記憶體環境程式設計。 ## BERT Tutorial [BERT Tutorial](https://drive.google.com/drive/u/0/folders/1aD7GDY2MCq_ZVGk9Rc40QMFpQ_WC06IL) **BERT** is a Transformer model that: - Designed to learn powerful methods of encoding representations from text - Demonstrated state-of-the-art results on many NLP problems in many languages