# concept of NN ###### tags: `notes` `tips` `ML` `NN` `basic` [toc] ## todo - [QA1](https://blog.csdn.net/guanxs/article/details/105377319?fbclid=IwAR1jkH30IRj3MtbCZ26GBLVjbQHpZNhCOnVX7F0vqXU1bnat3xxI4zgwU3Q) - [QA2](https://zhuanlan.zhihu.com/p/165225012?fbclid=IwAR2EnvmATR9xku37JaWJL4T2MUutM3IDk6GM0GXH1xaTQ5R_DgxlRGNsUnw) ## data loading 1. training instances 2. batches ## forward pass 1. generate prediction 2. compare difference - loss - mutual information ## backward pass 1. determine the change - gradient - back propagation ## update 1. update according to gradient or scoring 2. learning rate 3. step, epoch ## parallel execution 1. four dilemma: - complicated instructions in DDP(Distributed Data Parallel) packages - different performance between GPUs - restrictions of training platform or environment(docker, openMPI,...) - special formats of well-performing released pre-train models 2. solution with **hetseq** package - [github](https://github.com/yifding/hetseq) - [document](https://hetseq.readthedocs.io/) - [paper](https://arxiv.org/pdf/2009.14783) - basic steps: define a new Task with the corresponding Model, Dataset, Optimizer and Learning Rate Scheduler. ![](https://i.imgur.com/Z1stZEu.png) - [an example and briefly introducing](https://towardsdatascience.com/training-bert-at-a-university-eedcf940c754) ## related topics 1. pre-trained models for NLP tasks: - not only BERT is pre-trained, but also down-stream tasks predict models are pre-trained: [bert-related-models](https://huggingface.co/transformers/model_doc/bert.html) ## Q&A -what is DQN, policy gradient, A3C?