# CoTAI Internship - Bao L. Q. Nguyen
## 1. Personal Work Journal
https://paperswithcode.com/paper/internvideo2-scaling-video-foundation-models
https://arxiv.org/pdf/2403.15377v3
https://arxiv.org/pdf/2008.01334
https://svdbase.github.io/files/ICCV19_SVD.pdf
https://github.com/4ML-platform/ndvr
https://katna.readthedocs.io/en/latest/index.html
https://github.com/fastcatai/frame-extraction
https://github.com/titania7777/FrameExtractor
https://arxiv.org/pdf/2012.03457
https://ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/
https://www.algorithm-archive.org/contents/affine_transformations/affine_transformations.html
https://pytorchvideo.org/docs/tutorial_classification
https://pytorchvideo.readthedocs.io/en/latest/index.html
https://github.com/facebookresearch/pytorchvideo
https://arxiv.org/pdf/1711.11248v3
https://arxiv.org/pdf/1812.03982
https://github.com/facebookresearch/SlowFast
https://arxiv.org/pdf/2004.04981
https://arxiv.org/pdf/1412.0767
https://arxiv.org/pdf/1611.02155
https://arxiv.org/pdf/1406.2199
https://arxiv.org/pdf/1212.0402
https://github.com/jeffreyyihuang/two-stream-action-recognition/tree/master
https://arxiv.org/pdf/1412.0767
### July 30th
[defusion gpt](https://www.reddit.com/r/StableDiffusion/comments/16p8w3y/dalle_3_better_at_complex_prompts_than_stable/)
Async concept: https://fastapi.tiangolo.com/async/
Outlier Detection & Data Drifting
[Quantization](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization)
Evidently
fiftyone
https://github.com/SeldonIO/alibi-detect
free-time: bytebytego book & youtube
###
Object Detection [Car license](https://datasetninja.com)
XCiT: Cross-Covariance Image Transformers - [paper](https://arxiv.org/pdf/2106.09681) - [code](https://github.com/facebookresearch/xcit/tree/main)
Training data-efficient image transformers & distillation - Deit - [paper](https://arxiv.org/pdf/2012.12877) - [code](https://github.com/facebookresearch/deit)
[ConvNeXt](https://github.com/facebookresearch/ConvNeXt)
Scaling Up Kernels to 31x31: RepLK Net - [paper](https://arxiv.org/pdf/2203.06717) - [code](https://github.com/DingXiaoH/RepLKNet-pytorch/tree/main)
[mmDetection](https://github.com/open-mmlab/mmcv)
### July 4th
[Triton Inference Server Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html)
[Images distribution](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
[Nvidia Triton Server](https://github.com/triton-inference-server/server)
[Nvidia Triton Client](https://github.com/triton-inference-server/client)
[GRPC protocol](https://grpc.io/docs/what-is-grpc/)
[Grafana](https://grafana.com)
[Prometheus](https://prometheus.io)
### July 3rd
[Non-maximum suppression](https://viblo.asia/p/tim-hieu-va-trien-khai-thuat-toan-non-maximum-suppression-bJzKmr66Z9N)
[Non-maximum suppression tricks by YOLOv5](https://github.com/ultralytics/yolov5/tree/master)
Tokenizer techniques:
- [Hugging Face Tutorials](https://huggingface.co/docs/transformers/en/main_classes/tokenizer)
- [Tokenizer Andrej Karpathy](https://youtu.be/zduSFxRajkE?si=o6VlJaGkIIAmGo-P)
- [ViSoBERT Hugging Face](https://huggingface.co/uitnlp/visobert) - [Paper](https://aclanthology.org/2023.emnlp-main.315.pdf)
- [UIT-ViIC: A Dataset for the First Evaluation on
Vietnamese Image Captioning](https://arxiv.org/pdf/2002.00175)
- [UIT NLP](https://nlp.uit.edu.vn/research)
### July 2nd
[D2L object detection](https://d2l.ai/chapter_computer-vision/bounding-box.html)
[Anchor Boxes](https://d2l.ai/chapter_computer-vision/anchor.html)
[Single Shot Detecter](https://developers.arcgis.com/python/guide/how-ssd-works/)
[Deep Learning Bible - 4. Object Detection - Eng.](https://wikidocs.net/book/8119)
[Oreilly object detection](https://learning.oreilly.com/course/computer-vision-python/9781800567481/)
### July 1st
[FeatUp: A Model-Agnostic Framework for Features at Any Resolution](https://mhamilton.net/featup.html)
[Nerf Representing Scenes as Neural Radiance Fields for View Synthesis](https://arxiv.org/pdf/2003.08934) $\rightarrow$ [3D Deep Learning](http://3ddl.stanford.edu)
[Joint Bilateral Upsampler](https://www.researchgate.net/publication/220184258_Joint_bilateral_upsampling)
[Mark T. Hamilton](https://mhamilton.net)
[Guassian Likelihood Loss](https://arxiv.org/pdf/2007.06059)
[Multiscale Object Detection](https://d2l.ai/chapter_computer-vision/multiscale-object-detection.html)
[Multiscale Anchor Boxes](https://www.oreilly.com/library/view/practical-machine-learning/9781098102357/ch04.html#a_feature_pyramid_network_in_detaildot_f)
### June 21th
[Detectron](https://github.com/facebookresearch/Detectron/tree/main)
[mmdetection](https://github.com/open-mmlab/mmdetection)
[PaddleDectection](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.7)
[SSD Single Shot Multibox Detector](https://arxiv.org/pdf/1512.02325)
[DETRs Beat YOLOs on Real-time Object Detection](https://arxiv.org/pdf/2304.08069)
[You only look onnce](https://arxiv.org/pdf/1506.02640)
[SSD Slide by Caffe](http://www.cs.unc.edu/~wliu/papers/ssd_eccv2016_slide.pdf)
[Faster R-CNN](https://arxiv.org/pdf/1506.01497v3)
[ConVNet](https://arxiv.org/pdf/2201.03545)
### June 20th
- [Review: Pre-Activation ResNet with Identity Mapping — Over 1000 Layers Reached (Image Classification)](https://freedium.cfd/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fresnet-with-identity-mapping-over-1000-layers-reached-image-classification-bb50a42af03e)
- [Identity mapping in Deep Residuals Networks] (https://arxiv.org/pdf/1603.05027) --> math proof
- Identity Trick
- [VAE](https://github.com/MatchLab-Imperial/deep-learning-course/blob/master/07_VAE_GAN.ipynb)
- [VAE celebA](https://colab.research.google.com/github/goodboychan/goodboychan.github.io/blob/main/_notebooks/2021-09-14-03-Variational-AutoEncoder-Celeb-A.ipynb)
- [Goodboychan courses](https://goodboychan.github.io)
- [ResNet strikeback timm](https://arxiv.org/pdf/2110.00476)
- [GMVAE Pytorch](https://colab.research.google.com/drive/1jGOAgwleppSMtUsr7XaldRNBbiwBMhxd#scrollTo=rs2-BLGkfp8m)
### June 19th
[PaletteNet: Image Recolorization with Given Color Palet](https://www.researchgate.net/publication/319277684_PaletteNet_Image_Recolorization_with_Given_Color_Palette)
[Open Model DB](https://openmodeldb.info/?t=arch%3Aesrgan)
Task: Code an simple autoencoder for mnist reconstruction.
[Deconvolution](https://www.matthewzeiler.com/mattzeiler/deconvolutionalnetworks.pdf)
- Seperable is not always faster if the input resolution is large and the depth (num features) is not big.
- Write math and prove batchnorm + convolution (reparameterization of repvgg)
## June 18th
- [Encouraging Categorical Meaning in the Latent Space of a VAE](https://www.nathanblair.me/pdfs/Encouraging_categorical_meaning_in_the_latent_space_of_a_VAE.pdf)
- [Gaussian mixture variational autoencoders](https://arxiv.org/pdf/1611.02648)
- [GMVAE](https://github.com/jariasf/GMVAE/tree/master)
### June 15th
- [Memory Access Cost](https://arxiv.org/pdf/1807.11164)
- [MobileOne](https://arxiv.org/pdf/2206.04040): An Improved One millisecond Mobile Backbone
- Learn [Probabilistic Deep Learning](https://www.coursera.org/learn/probabilistic-deep-learning-with-tensorflow2/lecture/ULMEk/welcome-to-probabilistic-deep-learning-with-tensorflow-2)
- Thống kê tổng quất các cách làm mô hình nhanh và nhẹ cho mobile.
- [NYU Deep Learning](https://atcold.github.io/NYU-DLSP21/)
- [Gaussian mixture variational autoencoders](https://arxiv.org/pdf/1611.02648)
- [ShuffleNetV2](https://arxiv.org/pdf/1807.11164)
- Finetuning $\rightarrow$ considering dropout, bactchnorm, modules want to freeze/unfreeze.
- Divergence loss
### June 14th
Visualization: [DeepInsight](https://github.com/deepinsight/insightface) [Deep visualization toolbox](https://www.youtube.com/watch?v=AgkfIQ4IGaM&t=2s) application ([DeepVis](https://yosinski.com/deepvis)), [gc](https://keras.io/examples/vision/grad_cam/)
- [Mobilenet](https://arxiv.org/pdf/1704.04861) (inverted bottleneck, new activation, depthwise separable conv)
- [Shufflenet](https://arxiv.org/pdf/1707.01083) (what is pointwise conv, channel shuffle)
- [Squeeze & excitation](https://arxiv.org/pdf/1709.01507) (attention mechanism)
- [Efficientnet](https://arxiv.org/pdf/1905.11946) (scaling the network)
- [Ghostnet](https://arxiv.org/pdf/1911.11907) (what is strong feature)
- [Repvgg](https://arxiv.org/pdf/2101.03697): mobile one (reparameteriaze, best practice of mobileone - similar to mobilenetv4)
- [Decoupled Weight Decay Regularization](https://arxiv.org/pdf/1711.05101)
- [Generative Adversarial Nets](https://arxiv.org/pdf/1406.2661): [Keras tutorial](https://www.analyticsvidhya.com/blog/2021/06/a-detailed-explanation-of-gan-with-implementation-using-tensorflow-and-keras/) - [Pytorch tutorial](https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html) - [Gradio experiment](https://www.gradio.app/guides/create-your-own-friends-with-a-gan) - [NotExist](https://thispersondoesnotexist.com)
- [Emojize human face](https://arxiv.org/pdf/1611.02200) + [3D Generative-Adversarial Modeling](https://arxiv.org/pdf/1610.07584)
### June 13th
I completed the task of implementing Inception Module, GoogleNet, ResNet, and DenseNet from scratch in very basic/simple code that beginners can understand, and compared them with the torchvision library in this notebook. I also discovered that DenseNet always has $\geq 100$ layers, but the model size is still very lightweight. Additionally, I found a new technique for [efficient memory in DenseNet](https://arxiv.org/pdf/1711.09224) and a technique called [gradient checkpoint](https://freedium.cfd/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9).
### June 12th
I was introduced to the definition of effective receptive field by reading the paper [Understanding the Effective Receptive Field in Deep Convolutional Neural Networks](https://arxiv.org/pdf/1701.04128), which is full of mathematical formulations, and this [explanation blog](https://freedium.cfd/understanding-the-effective-receptive-field-in-deep-convolutional-neural-networks-b2642297927e). Then, I implemented [this notebook](https://colab.research.google.com/drive/1p3w5jiaEBqVr6Rwi5X70lP4YAGic0FzT?usp=sharing) to visualize the receptive fields of given pretrained models to see which input pixel $x_i$ in the image $X \in \mathbb{R}^{c \times h \times w}$ affects the output the most (which region determines the output result). Additionally, I learned to follow well-known repositories (e.g., [timm](https://github.com/huggingface/pytorch-image-models)).
### June 11th
I learned about the Knowledge Distillation technique in Deep Learning training, where a teacher model (a large model) and a student model (a smaller version) mimic the teacher model, resulting in a lightweight model with accuracy comparable to the large one. I was introduced to this [documentation from Neural Compressor](https://github.com/intel/neural-compressor/blob/master/docs/source/distillation.md). Additionally, I was introduced to [DistilBERT](https://arxiv.org/pdf/1910.01108) ([code](https://huggingface.co/docs/transformers/en/model_doc/distilbert)) and [MEAL V2 (boosting Vanilla ResNet 50 to 80%)](https://arxiv.org/pdf/2009.08453) ([code](https://github.com/szq0214/MEAL-V2)).
### June 10th
I had the chance to review what I had learned about model acceleration and prepare for future learning about monitoring/MLOps and deep learning on mobile. I was assigned the task of writing documentation (both theory and code) on [model acceleration](https://hackmd.io/@bao-internship/model_acceleration) for reusable purposes. I was also introduced to the simple monitoring library [W&B](https://colab.research.google.com/drive/1aGdTNoeRUzyKiFw3dVXBY_V1QQWpZhW_?usp=sharing).
### June 9th
I learned about Docker on Linux/WSL and installed CUDA/cuDNN for the TensorRT backend. I encountered numerous bugs since the Docker image I used did not automatically install cuDNN compatible with TensorRT's requirements. Finally, after manually installing cuDNN and copying the necessary `*.h` files into the environment path, I successfully ran TensorRT. Then, I learned to implement the `Dockerfile` to build the Docker image and the `docker-compose.yml` file for building the container. It was an exhausting process to identify and fix the bugs. See more at [this documentation](https://hackmd.io/@bao-internship/Docker-linux) written by me.
### June 8th
I was assigned a task to convert my model into Torch Script and compile it with different backends (e.g., OpenVINO, Torch-TensorRT) for better inference time. I also learned about Triton from OpenAI, a method that accelerates matmul operations in Transformers with a large number of tokens, which can be applied to my model. The Cross Entropy trick ([Log-Sum-Exp](https://www.youtube.com/watch?v=MZ2VM32h37g)) in Triton implementation is also intriguing.
[Torch Script](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html), [Torch Script Medium](https://freedium.cfd/hihuaweizhu/key-points-to-grasp-for-torchscript-beginners-c02cf94aaa50), [Torch Compiler](https://github.com/pytorch/pytorch), [Torch TensorRT](https://github.com/pytorch/TensorRT), [Triton OpenAI](https://openai.com/research/triton), [Triton Code](https://github.com/triton-lang/triton), [Reddit Triton](https://www.reddit.com/r/OpenAI/comments/18nf310/openai_triton_coursetutorial_recommendations/), [Triton-lang Tutorial](https://triton-lang.org/main/getting-started/tutorials/index.html), [Optuna](https://optuna.org)
### June 7th
I was tasked with preparing the solution for [assignment 1 (DL4AI)](https://colab.research.google.com/drive/167aJsLd98Xht-p1QV-tCHNZiAWucb1yE?usp=sharing). The requirements included following the code format of Trainer classes written in Hugging Face. I mimicked the way torchmetrics implement metric computation to maintain clean code in the Trainer. Additionally, I realized that while complex coding is not difficult, simplifying complex concepts for learners is really difficult. Here is the [simplified version](https://colab.research.google.com/drive/1GqCyFUPNe_EwSzznTTzqM6HYDHPZRzjt?usp=sharing) with text explanation suitable for beginners.
### June 6th
I was introduced to a wealth of materials, tools, and skills frameworks to enhance my engineering abilities. I also explored two papers in computer vision: one on deploying a semantic segmentation model on mobile and both involving customized Transformer architectures. The DINO paper is particularly useful for understanding how high-norm tokens create artifacts in feature maps. The technique of adding additional tokens like $\text{[CLS]}$ or $\text{[REG]}$ is remarkable. I was introduced to [PEFT](https://github.com/huggingface/peft) for finetuning too.
[Hugging Face](https://huggingface.co), [Timm](https://github.com/rwightman/pytorch-image-models), [PyTorch Image Models](https://github.com/huggingface/pytorch-image-models), [WanDB](https://wandb.ai/site), [Torch Segmentation](https://segmentation-models-pytorch.readthedocs.io/en/latest/index.html), [Albumentations](https://albumentations.ai), [Segformer](https://arxiv.org/pdf/2105.15203), [SeaFormer](https://arxiv.org/pdf/2301.13156), [Detectron2](https://github.com/facebookresearch/detectron2), [Gradient Accumulation](https://aman.ai/primers/ai/grad-accum-checkpoint/#:~:text=Gradient%20accumulation%20is%20a%20technique,after%20each%20batch%20of%20data.), [Fine-tune LLMs](https://lightning.ai/blog/gradient-accumulation/), [Torch Metrics](https://lightning.ai/docs/torchmetrics/stable/)
### June 5th
I learned about half/mixed precision, quantization, and how LoRA works. The concept of LoRA (decomposition for better efficiency) is similar to many convolutional decomposition methods in Inception paper. I now understand modern fine-tuning techniques, especially prefix-tuning with prefix tokens for LLMs. I gained a broader perspective on the computer vision field, especially how ConvNets emerged after ViT. The BackProp blog and the Recipe for Training NNs have prepared me to maintain a more professional engineering codebase.
### June 4th
I delved deeper into the process of training a PyTorch deep learning model and converting it to the ONNX format for improved inference time and size efficiency. The system design book "byte-go" is fascinating. Additionally, I discovered the field of model compression with numerous papers, including LoRA.
### June 3rd
I gained a deep understanding of the process: math → code → principles → mindmap → visualization → tinkering. This approach enhances my math learning experience, making it more engaging. I also learned about the backend processing components and how to write a professional application using a layered architecture.
## 2. Jotted-down Key Points
- [ ] Read + Explore 5 papers
- [x] Inception, DenseNet, ResNet model [coding from scratch](https://colab.research.google.com/drive/1HLPsbr4AbUXTcfpIOgDET80ioIuyVLHJ?usp=sharing) from diagrams.
- [x] Write [model acceleration documentation](https://hackmd.io/@bao-internship/model_acceleration)
- [ ] Model deployment with Triton Nvidia
- [x] Compile [Torch Script](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html), [Torch Compiler OpenVino](https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html), [Torch-TensorRT](https://github.com/pytorch/TensorRT), and [NVIDA benchmark](https://developer.nvidia.com/blog/accelerating-inference-up-to-6x-faster-in-pytorch-with-torch-tensorrt/).
- [ ] Optimze Transformer Block in my model with [OpenAI/Triton](https://openai.com/index/triton/)
- [x] Create structure/guildline/code of [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) for DL4AI.
- [x] Revise Linear Algebra and Multivariable Calculus.
- [x] Tinkering with math ([formulation](https://bit.ly/4aO8zoQ)) ([tinkering](https://github.com/kyle-paul/math_foundation))
- [x] Revisualize the rotation by orthogornal matrix $𝑈$ with respect to a set of vectors.
- [x] Implement a process in which $U$ matrix can be modified to create different effects (e.g, flipping, rotation 45$^\circ$, ...) $\rightarrow$ know which matrix $U$ helps to achieve a certain change.
- [x] Check if $U$ is not orthogornal matrix $\rightarrow$ what happens.
- [x] Explore the application of orthorgonal matrix: linear transformation, principal component analysis, etc.
- [ ] Explore D3 JavaScripts library for data visualization.
- [x] Read 3 blogs about backend processing & get familiar with some dev notation (e.g, router, DI (dependency injection), DTO (data transfer object), etc ).
- [x] [Layered Architecture and Design Patterns](https://freedium.cfd/https://levelup.gitconnected.com/write-python-apps-using-layered-architecture-and-design-patterns-75cb29b20c99)
- [x] [Repository Pattern](https://freedium.cfd/global-identity-2?redirectUrl=https%3A%2F%2Fpython.plainenglish.io%2Frepository-pattern-is-insane-if-you-know-how-to-use-it-properly-python-88a05f03a50c)
- [x] [Validator Pattern](https://freedium.cfd/global-identity-2?redirectUrl=https%3A%2F%2Flevelup.gitconnected.com%2Fvalidator-pattern-do-you-know-how-to-validate-your-data-properly-50edc5b3c6c6)
- [ ] All about ONNX:
- [x] Reimplement/retrain my model RotCAtt-TransUNet++ for compatible ONNX conversion.
- [x] Compress ONNX model with [neural-compressor](https://github.com/intel/neural-compressor/tree/master) (Intel)
- [x] Compress ONNX model with [onnxconverter-common](https://github.com/microsoft/onnxconverter-common/tree/master) (Microsoft)
- [x] Inference ONNX model with [OpenVINO](https://docs.openvino.ai/2024/home.html)
- [ ] Install and inference with [ONNX Tensorrt](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html) (reinstall cuda/cudnn globally)
- [ ] Learn and practice [System Design Theory](https://bytebytego.com/courses/system-design-interview/foreword)
- [ ] Explore documents and code base of above libs & [Napari](https://github.com/napari/napari).
- [ ] Read papers & [take notes](https://hackmd.io/@BouBou/BkjTYH6ER) & explore code of:
- [x] [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) $\rightarrow$ [PEFT](https://huggingface.co/docs/peft/main/en/developer_guides/lora)
- [x] [Parameter-Efficient Transfer Learning for NLP](https://arxiv.org/pdf/1902.00751)
- [x] [Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning](https://arxiv.org/pdf/2012.13255)
- [ ] [MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning](https://arxiv.org/pdf/2405.12130)
- [ ] [QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314)
- [x] [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190)
- [ ] [Visual Instruction Tuning](https://arxiv.org/pdf/2304.08485)
- [ ] [Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal](https://arxiv.org/pdf/2401.06209)
- [x] [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567)
- [ ] [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946)
- [x] [Vision Transformers Need Registers](https://arxiv.org/pdf/2309.16588)
- [x] [Clip: Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020)
- [x] [A ConvNet for the 2020s](https://arxiv.org/pdf/2201.03545)
- [ ] [(OpenCLIP) Reproducible scaling laws for contrastive language-image learning](https://arxiv.org/pdf/2212.07143)
- [x] [BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding](https://arxiv.org/pdf/1810.04805)
- [x] Understand [BackProp](https://freedium.cfd/@karpathy/yes-you-should-understand-backprop-e2f06eab496b#.8ao38h4o1) to debug my model.
- [x] Read "[A Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/)"