# CoTAI Internship - Bao L. Q. Nguyen ## 1. Personal Work Journal https://paperswithcode.com/paper/internvideo2-scaling-video-foundation-models https://arxiv.org/pdf/2403.15377v3 https://arxiv.org/pdf/2008.01334 https://svdbase.github.io/files/ICCV19_SVD.pdf https://github.com/4ML-platform/ndvr https://katna.readthedocs.io/en/latest/index.html https://github.com/fastcatai/frame-extraction https://github.com/titania7777/FrameExtractor https://arxiv.org/pdf/2012.03457 https://ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/ https://www.algorithm-archive.org/contents/affine_transformations/affine_transformations.html https://pytorchvideo.org/docs/tutorial_classification https://pytorchvideo.readthedocs.io/en/latest/index.html https://github.com/facebookresearch/pytorchvideo https://arxiv.org/pdf/1711.11248v3 https://arxiv.org/pdf/1812.03982 https://github.com/facebookresearch/SlowFast https://arxiv.org/pdf/2004.04981 https://arxiv.org/pdf/1412.0767 https://arxiv.org/pdf/1611.02155 https://arxiv.org/pdf/1406.2199 https://arxiv.org/pdf/1212.0402 https://github.com/jeffreyyihuang/two-stream-action-recognition/tree/master https://arxiv.org/pdf/1412.0767 ### July 30th [defusion gpt](https://www.reddit.com/r/StableDiffusion/comments/16p8w3y/dalle_3_better_at_complex_prompts_than_stable/) Async concept: https://fastapi.tiangolo.com/async/ Outlier Detection & Data Drifting [Quantization](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization) Evidently fiftyone https://github.com/SeldonIO/alibi-detect free-time: bytebytego book & youtube ### Object Detection [Car license](https://datasetninja.com) XCiT: Cross-Covariance Image Transformers - [paper](https://arxiv.org/pdf/2106.09681) - [code](https://github.com/facebookresearch/xcit/tree/main) Training data-efficient image transformers & distillation - Deit - [paper](https://arxiv.org/pdf/2012.12877) - [code](https://github.com/facebookresearch/deit) [ConvNeXt](https://github.com/facebookresearch/ConvNeXt) Scaling Up Kernels to 31x31: RepLK Net - [paper](https://arxiv.org/pdf/2203.06717) - [code](https://github.com/DingXiaoH/RepLKNet-pytorch/tree/main) [mmDetection](https://github.com/open-mmlab/mmcv) ### July 4th [Triton Inference Server Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html) [Images distribution](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) [Nvidia Triton Server](https://github.com/triton-inference-server/server) [Nvidia Triton Client](https://github.com/triton-inference-server/client) [GRPC protocol](https://grpc.io/docs/what-is-grpc/) [Grafana](https://grafana.com) [Prometheus](https://prometheus.io) ### July 3rd [Non-maximum suppression](https://viblo.asia/p/tim-hieu-va-trien-khai-thuat-toan-non-maximum-suppression-bJzKmr66Z9N) [Non-maximum suppression tricks by YOLOv5](https://github.com/ultralytics/yolov5/tree/master) Tokenizer techniques: - [Hugging Face Tutorials](https://huggingface.co/docs/transformers/en/main_classes/tokenizer) - [Tokenizer Andrej Karpathy](https://youtu.be/zduSFxRajkE?si=o6VlJaGkIIAmGo-P) - [ViSoBERT Hugging Face](https://huggingface.co/uitnlp/visobert) - [Paper](https://aclanthology.org/2023.emnlp-main.315.pdf) - [UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning](https://arxiv.org/pdf/2002.00175) - [UIT NLP](https://nlp.uit.edu.vn/research) ### July 2nd [D2L object detection](https://d2l.ai/chapter_computer-vision/bounding-box.html) [Anchor Boxes](https://d2l.ai/chapter_computer-vision/anchor.html) [Single Shot Detecter](https://developers.arcgis.com/python/guide/how-ssd-works/) [Deep Learning Bible - 4. Object Detection - Eng.](https://wikidocs.net/book/8119) [Oreilly object detection](https://learning.oreilly.com/course/computer-vision-python/9781800567481/) ### July 1st [FeatUp: A Model-Agnostic Framework for Features at Any Resolution](https://mhamilton.net/featup.html) [Nerf Representing Scenes as Neural Radiance Fields for View Synthesis](https://arxiv.org/pdf/2003.08934) $\rightarrow$ [3D Deep Learning](http://3ddl.stanford.edu) [Joint Bilateral Upsampler](https://www.researchgate.net/publication/220184258_Joint_bilateral_upsampling) [Mark T. Hamilton](https://mhamilton.net) [Guassian Likelihood Loss](https://arxiv.org/pdf/2007.06059) [Multiscale Object Detection](https://d2l.ai/chapter_computer-vision/multiscale-object-detection.html) [Multiscale Anchor Boxes](https://www.oreilly.com/library/view/practical-machine-learning/9781098102357/ch04.html#a_feature_pyramid_network_in_detaildot_f) ### June 21th [Detectron](https://github.com/facebookresearch/Detectron/tree/main) [mmdetection](https://github.com/open-mmlab/mmdetection) [PaddleDectection](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.7) [SSD Single Shot Multibox Detector](https://arxiv.org/pdf/1512.02325) [DETRs Beat YOLOs on Real-time Object Detection](https://arxiv.org/pdf/2304.08069) [You only look onnce](https://arxiv.org/pdf/1506.02640) [SSD Slide by Caffe](http://www.cs.unc.edu/~wliu/papers/ssd_eccv2016_slide.pdf) [Faster R-CNN](https://arxiv.org/pdf/1506.01497v3) [ConVNet](https://arxiv.org/pdf/2201.03545) ### June 20th - [Review: Pre-Activation ResNet with Identity Mapping — Over 1000 Layers Reached (Image Classification)](https://freedium.cfd/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fresnet-with-identity-mapping-over-1000-layers-reached-image-classification-bb50a42af03e) - [Identity mapping in Deep Residuals Networks] (https://arxiv.org/pdf/1603.05027) --> math proof - Identity Trick - [VAE](https://github.com/MatchLab-Imperial/deep-learning-course/blob/master/07_VAE_GAN.ipynb) - [VAE celebA](https://colab.research.google.com/github/goodboychan/goodboychan.github.io/blob/main/_notebooks/2021-09-14-03-Variational-AutoEncoder-Celeb-A.ipynb) - [Goodboychan courses](https://goodboychan.github.io) - [ResNet strikeback timm](https://arxiv.org/pdf/2110.00476) - [GMVAE Pytorch](https://colab.research.google.com/drive/1jGOAgwleppSMtUsr7XaldRNBbiwBMhxd#scrollTo=rs2-BLGkfp8m) ### June 19th [PaletteNet: Image Recolorization with Given Color Palet](https://www.researchgate.net/publication/319277684_PaletteNet_Image_Recolorization_with_Given_Color_Palette) [Open Model DB](https://openmodeldb.info/?t=arch%3Aesrgan) Task: Code an simple autoencoder for mnist reconstruction. [Deconvolution](https://www.matthewzeiler.com/mattzeiler/deconvolutionalnetworks.pdf) - Seperable is not always faster if the input resolution is large and the depth (num features) is not big. - Write math and prove batchnorm + convolution (reparameterization of repvgg) ## June 18th - [Encouraging Categorical Meaning in the Latent Space of a VAE](https://www.nathanblair.me/pdfs/Encouraging_categorical_meaning_in_the_latent_space_of_a_VAE.pdf) - [Gaussian mixture variational autoencoders](https://arxiv.org/pdf/1611.02648) - [GMVAE](https://github.com/jariasf/GMVAE/tree/master) ### June 15th - [Memory Access Cost](https://arxiv.org/pdf/1807.11164) - [MobileOne](https://arxiv.org/pdf/2206.04040): An Improved One millisecond Mobile Backbone - Learn [Probabilistic Deep Learning](https://www.coursera.org/learn/probabilistic-deep-learning-with-tensorflow2/lecture/ULMEk/welcome-to-probabilistic-deep-learning-with-tensorflow-2) - Thống kê tổng quất các cách làm mô hình nhanh và nhẹ cho mobile. - [NYU Deep Learning](https://atcold.github.io/NYU-DLSP21/) - [Gaussian mixture variational autoencoders](https://arxiv.org/pdf/1611.02648) - [ShuffleNetV2](https://arxiv.org/pdf/1807.11164) - Finetuning $\rightarrow$ considering dropout, bactchnorm, modules want to freeze/unfreeze. - Divergence loss ### June 14th Visualization: [DeepInsight](https://github.com/deepinsight/insightface) [Deep visualization toolbox](https://www.youtube.com/watch?v=AgkfIQ4IGaM&t=2s) application ([DeepVis](https://yosinski.com/deepvis)), [gc](https://keras.io/examples/vision/grad_cam/) - [Mobilenet](https://arxiv.org/pdf/1704.04861) (inverted bottleneck, new activation, depthwise separable conv) - [Shufflenet](https://arxiv.org/pdf/1707.01083) (what is pointwise conv, channel shuffle) - [Squeeze & excitation](https://arxiv.org/pdf/1709.01507) (attention mechanism) - [Efficientnet](https://arxiv.org/pdf/1905.11946) (scaling the network) - [Ghostnet](https://arxiv.org/pdf/1911.11907) (what is strong feature) - [Repvgg](https://arxiv.org/pdf/2101.03697): mobile one (reparameteriaze, best practice of mobileone - similar to mobilenetv4) - [Decoupled Weight Decay Regularization](https://arxiv.org/pdf/1711.05101) - [Generative Adversarial Nets](https://arxiv.org/pdf/1406.2661): [Keras tutorial](https://www.analyticsvidhya.com/blog/2021/06/a-detailed-explanation-of-gan-with-implementation-using-tensorflow-and-keras/) - [Pytorch tutorial](https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html) - [Gradio experiment](https://www.gradio.app/guides/create-your-own-friends-with-a-gan) - [NotExist](https://thispersondoesnotexist.com) - [Emojize human face](https://arxiv.org/pdf/1611.02200) + [3D Generative-Adversarial Modeling](https://arxiv.org/pdf/1610.07584) ### June 13th I completed the task of implementing Inception Module, GoogleNet, ResNet, and DenseNet from scratch in very basic/simple code that beginners can understand, and compared them with the torchvision library in this notebook. I also discovered that DenseNet always has $\geq 100$ layers, but the model size is still very lightweight. Additionally, I found a new technique for [efficient memory in DenseNet](https://arxiv.org/pdf/1711.09224) and a technique called [gradient checkpoint](https://freedium.cfd/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9). ### June 12th I was introduced to the definition of effective receptive field by reading the paper [Understanding the Effective Receptive Field in Deep Convolutional Neural Networks](https://arxiv.org/pdf/1701.04128), which is full of mathematical formulations, and this [explanation blog](https://freedium.cfd/understanding-the-effective-receptive-field-in-deep-convolutional-neural-networks-b2642297927e). Then, I implemented [this notebook](https://colab.research.google.com/drive/1p3w5jiaEBqVr6Rwi5X70lP4YAGic0FzT?usp=sharing) to visualize the receptive fields of given pretrained models to see which input pixel $x_i$ in the image $X \in \mathbb{R}^{c \times h \times w}$ affects the output the most (which region determines the output result). Additionally, I learned to follow well-known repositories (e.g., [timm](https://github.com/huggingface/pytorch-image-models)). ### June 11th I learned about the Knowledge Distillation technique in Deep Learning training, where a teacher model (a large model) and a student model (a smaller version) mimic the teacher model, resulting in a lightweight model with accuracy comparable to the large one. I was introduced to this [documentation from Neural Compressor](https://github.com/intel/neural-compressor/blob/master/docs/source/distillation.md). Additionally, I was introduced to [DistilBERT](https://arxiv.org/pdf/1910.01108) ([code](https://huggingface.co/docs/transformers/en/model_doc/distilbert)) and [MEAL V2 (boosting Vanilla ResNet 50 to 80%)](https://arxiv.org/pdf/2009.08453) ([code](https://github.com/szq0214/MEAL-V2)). ### June 10th I had the chance to review what I had learned about model acceleration and prepare for future learning about monitoring/MLOps and deep learning on mobile. I was assigned the task of writing documentation (both theory and code) on [model acceleration](https://hackmd.io/@bao-internship/model_acceleration) for reusable purposes. I was also introduced to the simple monitoring library [W&B](https://colab.research.google.com/drive/1aGdTNoeRUzyKiFw3dVXBY_V1QQWpZhW_?usp=sharing). ### June 9th I learned about Docker on Linux/WSL and installed CUDA/cuDNN for the TensorRT backend. I encountered numerous bugs since the Docker image I used did not automatically install cuDNN compatible with TensorRT's requirements. Finally, after manually installing cuDNN and copying the necessary `*.h` files into the environment path, I successfully ran TensorRT. Then, I learned to implement the `Dockerfile` to build the Docker image and the `docker-compose.yml` file for building the container. It was an exhausting process to identify and fix the bugs. See more at [this documentation](https://hackmd.io/@bao-internship/Docker-linux) written by me. ### June 8th I was assigned a task to convert my model into Torch Script and compile it with different backends (e.g., OpenVINO, Torch-TensorRT) for better inference time. I also learned about Triton from OpenAI, a method that accelerates matmul operations in Transformers with a large number of tokens, which can be applied to my model. The Cross Entropy trick ([Log-Sum-Exp](https://www.youtube.com/watch?v=MZ2VM32h37g)) in Triton implementation is also intriguing. [Torch Script](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html), [Torch Script Medium](https://freedium.cfd/hihuaweizhu/key-points-to-grasp-for-torchscript-beginners-c02cf94aaa50), [Torch Compiler](https://github.com/pytorch/pytorch), [Torch TensorRT](https://github.com/pytorch/TensorRT), [Triton OpenAI](https://openai.com/research/triton), [Triton Code](https://github.com/triton-lang/triton), [Reddit Triton](https://www.reddit.com/r/OpenAI/comments/18nf310/openai_triton_coursetutorial_recommendations/), [Triton-lang Tutorial](https://triton-lang.org/main/getting-started/tutorials/index.html), [Optuna](https://optuna.org) ### June 7th I was tasked with preparing the solution for [assignment 1 (DL4AI)](https://colab.research.google.com/drive/167aJsLd98Xht-p1QV-tCHNZiAWucb1yE?usp=sharing). The requirements included following the code format of Trainer classes written in Hugging Face. I mimicked the way torchmetrics implement metric computation to maintain clean code in the Trainer. Additionally, I realized that while complex coding is not difficult, simplifying complex concepts for learners is really difficult. Here is the [simplified version](https://colab.research.google.com/drive/1GqCyFUPNe_EwSzznTTzqM6HYDHPZRzjt?usp=sharing) with text explanation suitable for beginners. ### June 6th I was introduced to a wealth of materials, tools, and skills frameworks to enhance my engineering abilities. I also explored two papers in computer vision: one on deploying a semantic segmentation model on mobile and both involving customized Transformer architectures. The DINO paper is particularly useful for understanding how high-norm tokens create artifacts in feature maps. The technique of adding additional tokens like $\text{[CLS]}$ or $\text{[REG]}$ is remarkable. I was introduced to [PEFT](https://github.com/huggingface/peft) for finetuning too. [Hugging Face](https://huggingface.co), [Timm](https://github.com/rwightman/pytorch-image-models), [PyTorch Image Models](https://github.com/huggingface/pytorch-image-models), [WanDB](https://wandb.ai/site), [Torch Segmentation](https://segmentation-models-pytorch.readthedocs.io/en/latest/index.html), [Albumentations](https://albumentations.ai), [Segformer](https://arxiv.org/pdf/2105.15203), [SeaFormer](https://arxiv.org/pdf/2301.13156), [Detectron2](https://github.com/facebookresearch/detectron2), [Gradient Accumulation](https://aman.ai/primers/ai/grad-accum-checkpoint/#:~:text=Gradient%20accumulation%20is%20a%20technique,after%20each%20batch%20of%20data.), [Fine-tune LLMs](https://lightning.ai/blog/gradient-accumulation/), [Torch Metrics](https://lightning.ai/docs/torchmetrics/stable/) ### June 5th I learned about half/mixed precision, quantization, and how LoRA works. The concept of LoRA (decomposition for better efficiency) is similar to many convolutional decomposition methods in Inception paper. I now understand modern fine-tuning techniques, especially prefix-tuning with prefix tokens for LLMs. I gained a broader perspective on the computer vision field, especially how ConvNets emerged after ViT. The BackProp blog and the Recipe for Training NNs have prepared me to maintain a more professional engineering codebase. ### June 4th I delved deeper into the process of training a PyTorch deep learning model and converting it to the ONNX format for improved inference time and size efficiency. The system design book "byte-go" is fascinating. Additionally, I discovered the field of model compression with numerous papers, including LoRA. ### June 3rd I gained a deep understanding of the process: math → code → principles → mindmap → visualization → tinkering. This approach enhances my math learning experience, making it more engaging. I also learned about the backend processing components and how to write a professional application using a layered architecture. ## 2. Jotted-down Key Points - [ ] Read + Explore 5 papers - [x] Inception, DenseNet, ResNet model [coding from scratch](https://colab.research.google.com/drive/1HLPsbr4AbUXTcfpIOgDET80ioIuyVLHJ?usp=sharing) from diagrams. - [x] Write [model acceleration documentation](https://hackmd.io/@bao-internship/model_acceleration) - [ ] Model deployment with Triton Nvidia - [x] Compile [Torch Script](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html), [Torch Compiler OpenVino](https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html), [Torch-TensorRT](https://github.com/pytorch/TensorRT), and [NVIDA benchmark](https://developer.nvidia.com/blog/accelerating-inference-up-to-6x-faster-in-pytorch-with-torch-tensorrt/). - [ ] Optimze Transformer Block in my model with [OpenAI/Triton](https://openai.com/index/triton/) - [x] Create structure/guildline/code of [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) for DL4AI. - [x] Revise Linear Algebra and Multivariable Calculus. - [x] Tinkering with math ([formulation](https://bit.ly/4aO8zoQ)) ([tinkering](https://github.com/kyle-paul/math_foundation)) - [x] Revisualize the rotation by orthogornal matrix $𝑈$ with respect to a set of vectors. - [x] Implement a process in which $U$ matrix can be modified to create different effects (e.g, flipping, rotation 45$^\circ$, ...) $\rightarrow$ know which matrix $U$ helps to achieve a certain change. - [x] Check if $U$ is not orthogornal matrix $\rightarrow$ what happens. - [x] Explore the application of orthorgonal matrix: linear transformation, principal component analysis, etc. - [ ] Explore D3 JavaScripts library for data visualization. - [x] Read 3 blogs about backend processing & get familiar with some dev notation (e.g, router, DI (dependency injection), DTO (data transfer object), etc ). - [x] [Layered Architecture and Design Patterns](https://freedium.cfd/https://levelup.gitconnected.com/write-python-apps-using-layered-architecture-and-design-patterns-75cb29b20c99) - [x] [Repository Pattern](https://freedium.cfd/global-identity-2?redirectUrl=https%3A%2F%2Fpython.plainenglish.io%2Frepository-pattern-is-insane-if-you-know-how-to-use-it-properly-python-88a05f03a50c) - [x] [Validator Pattern](https://freedium.cfd/global-identity-2?redirectUrl=https%3A%2F%2Flevelup.gitconnected.com%2Fvalidator-pattern-do-you-know-how-to-validate-your-data-properly-50edc5b3c6c6) - [ ] All about ONNX: - [x] Reimplement/retrain my model RotCAtt-TransUNet++ for compatible ONNX conversion. - [x] Compress ONNX model with [neural-compressor](https://github.com/intel/neural-compressor/tree/master) (Intel) - [x] Compress ONNX model with [onnxconverter-common](https://github.com/microsoft/onnxconverter-common/tree/master) (Microsoft) - [x] Inference ONNX model with [OpenVINO](https://docs.openvino.ai/2024/home.html) - [ ] Install and inference with [ONNX Tensorrt](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html) (reinstall cuda/cudnn globally) - [ ] Learn and practice [System Design Theory](https://bytebytego.com/courses/system-design-interview/foreword) - [ ] Explore documents and code base of above libs & [Napari](https://github.com/napari/napari). - [ ] Read papers & [take notes](https://hackmd.io/@BouBou/BkjTYH6ER) & explore code of: - [x] [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) $\rightarrow$ [PEFT](https://huggingface.co/docs/peft/main/en/developer_guides/lora) - [x] [Parameter-Efficient Transfer Learning for NLP](https://arxiv.org/pdf/1902.00751) - [x] [Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning](https://arxiv.org/pdf/2012.13255) - [ ] [MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning](https://arxiv.org/pdf/2405.12130) - [ ] [QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314) - [x] [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190) - [ ] [Visual Instruction Tuning](https://arxiv.org/pdf/2304.08485) - [ ] [Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal](https://arxiv.org/pdf/2401.06209) - [x] [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567) - [ ] [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946) - [x] [Vision Transformers Need Registers](https://arxiv.org/pdf/2309.16588) - [x] [Clip: Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020) - [x] [A ConvNet for the 2020s](https://arxiv.org/pdf/2201.03545) - [ ] [(OpenCLIP) Reproducible scaling laws for contrastive language-image learning](https://arxiv.org/pdf/2212.07143) - [x] [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805) - [x] Understand [BackProp](https://freedium.cfd/@karpathy/yes-you-should-understand-backprop-e2f06eab496b#.8ao38h4o1) to debug my model. - [x] Read "[A Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/)"