# concept of NN
###### tags: `notes` `tips` `ML` `NN` `basic`
[toc]
## todo
- [QA1](https://blog.csdn.net/guanxs/article/details/105377319?fbclid=IwAR1jkH30IRj3MtbCZ26GBLVjbQHpZNhCOnVX7F0vqXU1bnat3xxI4zgwU3Q)
- [QA2](https://zhuanlan.zhihu.com/p/165225012?fbclid=IwAR2EnvmATR9xku37JaWJL4T2MUutM3IDk6GM0GXH1xaTQ5R_DgxlRGNsUnw)
## data loading
1. training instances
2. batches
## forward pass
1. generate prediction
2. compare difference
- loss
- mutual information
## backward pass
1. determine the change
- gradient
- back propagation
## update
1. update according to gradient or scoring
2. learning rate
3. step, epoch
## parallel execution
1. four dilemma:
- complicated instructions in DDP(Distributed Data Parallel) packages
- different performance between GPUs
- restrictions of training platform or environment(docker, openMPI,...)
- special formats of well-performing released pre-train models
2. solution with **hetseq** package
- [github](https://github.com/yifding/hetseq)
- [document](https://hetseq.readthedocs.io/)
- [paper](https://arxiv.org/pdf/2009.14783)
- basic steps: define a new Task with the corresponding Model, Dataset, Optimizer and Learning Rate Scheduler.

- [an example and briefly introducing](https://towardsdatascience.com/training-bert-at-a-university-eedcf940c754)
## related topics
1. pre-trained models for NLP tasks:
- not only BERT is pre-trained, but also down-stream tasks predict models are pre-trained: [bert-related-models](https://huggingface.co/transformers/model_doc/bert.html)
## Q&A
-what is DQN, policy gradient, A3C?