LLM Primer - HackMD

# 1. Materials - [Overview of latest LLMs, benchmark on diverse metrics](https://artificialanalysis.ai/) - https://boramorka.github.io/LLM-Book/ - https://github.com/stars/chiphuyen/lists/cool-llm-repos - https://medium.com/@sujathamudadla1213/difference-between-qlora-and-lora-for-fine-tuning-llms-0ea35a195535 - https://towardsdatascience.com/deep-dive-into-llama-3-by-hand-%EF%B8%8F-6c6b23dc92b2 - https://medium.com/intel-tech/four-data-cleaning-techniques-to-improve-large-language-model-llm-performance-77bee9003625 - https://medium.com/@haifengl/a-tutorial-to-llm-f78dd4e82efc - https://medium.com/@alexmriggio/lora-low-rank-adaptation-from-scratch-code-and-theory-f31509106650 - https://medium.com/@vipra_singh/building-llm-applications-introduction-part-1-1c90294b155b - https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues - https://ai.gopubby.com/run-the-strongest-open-source-llm-model-llama3-70b-with-just-a-single-4gb-gpu-7e0ea2ad8ba2 - https://jalammar.github.io/illustrated-bert/ - https://blog.gopenai.com/introduction-to-llm-agents-how-to-build-a-simple-reasoning-and-acting-agent-from-scratch-part-1-843e14686be7 - https://github.com/KruxAI/ragbuilder # 2. Workflow ## 1. Data Processing 🧹 - https://medium.com/@elias.tarnaras/unlocking-the-secrets-of-pdf-parsing-a-comparative-analysis-of-python-libraries-79064bf12174 - https://blog.gopenai.com/simple-ways-to-parse-pdfs-for-better-rag-systems-82ec68c9d8cd - https://medium.com/intel-tech/four-data-cleaning-techniques-to-improve-large-language-model-llm-performance-77bee9003625 - https://developer.nvidia.com/blog/processing-high-quality-vietnamese-language-data-with-nvidia-nemo-curator/ ... ## 2. Training/Finetuning ## 3. Deployment - https://aws.amazon.com/blogs/machine-learning/orchestrate-generative-ai-workflows-with-amazon-bedrock-and-aws-step-functions/ ## 4. Model Optimization - https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/ - https://neuralmagic.com/blog/24-sparse-llama-smaller-models-for-efficient-gpu-inference - https://developer.nvidia.com/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse # 3. Retrieval Augmented Generation (RAG) - https://medium.com/@vipra_singh/building-llm-applications-introduction-part-1-1c90294b155b - https://medium.com/@bijit211987/rag-vs-vectordb-2c8cb3e0ee52 - https://levelup.gitconnected.com/vector-databases-a-hands-on-tutorial-8d41445ea253 - https://deepnote.com/app/abid/spaCy-Resume-Analysis-81ba1e4b-7fa8-45fe-ac7a-0b7bf3da7826?Job_Category=AVIATION - https://github.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy/tree/master?tab=readme-ov-file - https://www.kaggle.com/code/sanikamal/spacy-advanced-nlp-in-python#spaCy:-Advanced-NLP-in-Python-%F0%9F%90%8D - https://medium.com/@thakermadhav/build-your-own-rag-with-mistral-7b-and-langchain-97d0c92fa146 - https://github.com/langchain-ai/langchain/blob/master/cookbook/retrieval_in_sql.ipynb ## 3.1 LLM Application Design Patterns [[Ref](https://freedium.cfd/https://medium.com/@code.brain/llm-application-design-patterns-8d20d1ab9b7a)] LLM application design patterns are structured approaches to building applications that leverage large language models (LLMs). These patterns provide a framework for developers to efficiently integrate LLMs into their systems and products. Here are some of the key design patterns: ### In-context Learning - Description: Utilizes LLMs off the shelf, controlling behavior through prompting and contextual data. - Components: LLMs, prompt templates, few-shot examples, external APIs, vector databases. - Benefits: Reduces AI problems to data engineering problems, real-time data incorporation. - Real-world Examples: Chatbots, legal document analysis. - Significance: Simplifies AI development, outperforms fine-tuning for small datasets. ``` graph TD A[User Query] -->|Input| B[Prompt Construction/Retrieval] B -->|Determines relevance| D[Vector Database] E[External APIs] -->|Provides data| B F[Embedding Model] -->|Processes data| D G[Data Preprocessing/Embedding] -->|Stores data| D B -->|Compiled Prompt| H[Prompt Execution/Inference] H -->|Submits to LLM| I[Pre-trained LLM] I -->|Inference| J[Operational Systems] J -->|Logging, Caching, Validation| K[User Response] ``` ![image](https://hackmd.io/_uploads/ByrCeaZcA.png) ### Data Preprocessing/Embedding - Description: Involves storing private data to be retrieved later, breaking documents into chunks, and storing them in a vector database. - Components: Embedding models, vector databases. - Benefits: Efficient data retrieval for LLM processing. - Real-world Examples: Data-sensitive applications requiring privacy. - Significance: Enables efficient handling of large datasets. ``` graph TD A[Raw Data] -->|Input| B[Data Preprocessing] B -->|Chunks| C[Embedding Model] C -->|Embeddings| D[Vector Database] D -->|Stored for Retrieval| E[LLM Application] E -->|Retrieves relevant data| F[Prompt Construction] ``` ![image](https://hackmd.io/_uploads/SkjlWpbqA.png) ### Prompt Construction/Retrieval - Description: Constructs prompts from a template, few-shot examples, external API data, and relevant documents. - Components: Prompt templates, few-shot examples, APIs, vector databases. - Benefits: Tailored prompts for specific queries, improved LLM responses. - Real-world Examples: Customized user queries in applications. - Significance: Enhances the relevance and accuracy of LLM outputs. ``` graph TD A[User Query] -->|Input| B[Prompt Template] B -->|Base for Prompt| C[Prompt Construction] D[Vector Database] -->|Relevant Documents| C E[External APIs] -->|Dynamic Data| C F[Few-Shot Examples] -->|Guidance| C C -->|Compiled Prompt| G[LLM Inference] G -->|Response| H[User Response] ``` ![image](https://hackmd.io/_uploads/HkOMW6ZcR.png) ### Prompt Execution/Inference - Description: Compiled prompts are submitted to LLMs for inference, with added systems like logging and validation. - Components: LLMs, operational systems. - Benefits: Real-time LLM inference, operational oversight. - Real-world Examples: Dynamic response generation in chatbots. - Significance: Streamlines the execution of complex LLM tasks. ``` graph TD A[User Query] -->|Input| B[Data Preprocessing/Embedding] B -->|Retrieve Relevant Data| C[Prompt Construction/Retrieval] C -->|Compile Prompt| D[Prompt Execution/Inference] D -->|Submit to LLM| E[LLM Inference] E -->|Generate Response| F[Operational Systems] F -->|Logging, Caching, Validation| G[User Response] ``` ![image](https://hackmd.io/_uploads/BkcIWp-q0.png) ### AI Agent Frameworks - Description: Frameworks that give AI apps new capabilities like complex problem-solving and learning from experience. - Components: Reasoning/planning tools, memory systems. - Benefits: Advanced AI capabilities, post-deployment learning. - Real-world Examples: Intelligent virtual assistants, adaptive systems. - Significance: Introduces advanced AI functionalities to applications. ``` graph TD A[User Input] -->|Triggers Agent| B[AI Agent Framework] B -->|Reasoning/Planning| C[Task Execution] C -->|Interacts with Tools/APIs| D[External Tools/APIs] D -->|Perform Actions| E[World Interaction] E -->|Feedback Loop| F[Memory/Recursion] F -->|Learn from Experience| B B -->|Generate Response| G[User Response] ``` ![image](https://hackmd.io/_uploads/Hkj_Wa-9A.png) --- # 4. [Interview] LLM Questions ❓ :::info - https://medium.com/@masteringllm - https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 - https://masteringllm.medium.com/how-to-prepare-for-large-language-models-llms-interview-a578e703b209 ::: ## 4.1 Modern NLP (LLM) __1. What is the token in a large language model context?__ __Answer:__ 1. Tokenization: Think of tokenization as a way to chop up text into small pieces. These pieces can be as short as a single character or a whole word. We call these little pieces “sub-word tokens.” It’s like cutting a cake into slices. 2. Types of Tokens: Tokens can represent entire words or just parts of them. For example, > The word “hamburger” is sliced into three tokens: “ham,” “bur,” and “ger.” But simple words like “pear” stay as one token. Imagine breaking “hamburger” into pieces like “ham” and “burger.” 3. Starting with Space: Some tokens have spaces at the beginning, like “ hello” or “ bye.” The space is also considered with the token. 4. Model’s Skills: These models are great at understanding how these tokens are related to each other. They’re like word detectives that figure out what comes next in a sequence of these token pieces. 5. Token Count: The number of tokens the model works with depends on how long your input and output text is. A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words) as per [OpenAI](https://www.linkedin.com/company/openai/). ![image](https://hackmd.io/_uploads/Hy7Frp02T.png) Use the link below in the comments to understand more about how #OpenAI counts token and a visual way to understand how tokens are split. https://platform.openai.com/tokenizer?source=post_page 6. The numbe of tokens __varies with each model__. The best way to get the number of tokens is to use the __specific tokenizer__ of that model, pass the text, and then get the length of the tokenized text. For example, the __number of tokens of FlanT5 LLM will be different from Llama2 or Mistral LLm__. __2. What are your strategies to calculate the cost of running LLMs?__ __Answer:__ The cost of running LLMs can be divided into 2 parts: - 🌟 𝗣𝗿𝗶𝘃𝗮𝘁𝗲 𝗟𝗟𝗠𝘀 (𝗟𝗶𝗸𝗲 𝗚𝗣𝗧 𝟯.𝟱 𝗼𝗿 𝟰 𝗠𝗼𝗱𝗲𝗹𝘀): - Private LLMs usually cost by counting either the __number of tokens__ (GPT 3.5 or 4) or using the __number of characters__ (PaLM). You can divide the cost into 2 parts: - 📝 𝙋𝙧𝙤𝙢𝙥𝙩 𝙤𝙧 𝙞𝙣𝙥𝙪𝙩 𝙩𝙤𝙠𝙚𝙣𝙨 𝙤𝙧 𝙘𝙝𝙖𝙧𝙖𝙘𝙩𝙚𝙧𝙨 - 📤 𝘾𝙤𝙢𝙥𝙡𝙚𝙩𝙞𝙤𝙣 𝙤𝙧 𝙤𝙪𝙩𝙥𝙪𝙩 𝙩𝙤𝙠𝙚𝙣𝙨 𝙤𝙧 𝙘𝙝𝙖𝙧𝙖𝙘𝙩𝙚𝙧𝙨 - __Strategy__: 1. Prompt tokens are usually easy to calculate. In the case of GPT 3.5 or 4, you can use the 𝘁𝗶𝗸𝘁𝗼𝗸𝗲𝗻 library to accurately find the number of tokens. Find a detailed notebook to calculate the number of tokens for different OpenAI models. - ![image](https://hackmd.io/_uploads/SyFoPpA2T.png) - https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb?source=post_page 2. Since output tokens depend on the specific task, there are several strategies to approaximately calculate: a. Take a __statistically significant sample and calculate output tokens__ to find the average number of output tokens. b. Limit __max output tokens__ in the API response. c. Try to __restrict the output tokens__ instead of free text, for example, output tokens can be restricted to give specific JSON format key and value pairs. - 🚀 𝗢𝗽𝗲𝗻 𝗦𝗼𝘂𝗿𝗰𝗲 𝗟𝗟𝗠𝘀: - Open source LLM cost can be calculated using: - If open source is available for commercial use without any restrictions, a. You can __create a benchmark__ by running parallel requests on a GPU machine. b. Calculate the __number of tokens and time__ utilized by the process. c. This will help us understand __X tokens/Min__. d. You can then calculate how much time is required to process all your tokens. e. You can find the __cost of running that that instance__ on the cloud. f. You can find a detailed example of running __Mistral AI vs GPT 4 cost__ in this detailed article. https://medium.com/@masteringllm/mistral-7b-is-187x-cheaper-compared-to-gpt-4-b8e5ee1c9fc2?source=post_page If the open-source model has a __restricted commercial license__, you might want to consider revenue generated by generating output. This can give us an approximate cost of running an LLM. __3. Can you provide a high-level overview of the training process of ChatGPT?__ __Answer:__ ChatGPT is trained in 3 steps: 1. 📚 𝗣𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴: - ChatGPT undergoes an initial phase called __pre-training__. - During this phase, Large Language Model (LLMs) like ChatGPT, such as GPT-3, are trained on an __extensive dataset sourced from the internet__. - The data is subjected to cleaning, preprocessing, and tokenization. - Transformers architectures, a best practice in natural language processing, are widely used during this phase. - The __primary objective here is to enable the model to predict the next word__ in a given sequence of text. - This phase equips the model with the capability to understand language patterns but does not yet provide it with the ability to comprehend instructions or questions. 2. 🛠️ 𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 𝗼𝗿 𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻 𝗧𝘂𝗻𝗶𝗻𝗴: - The next step is __supervised fine-tuning or instruction tuning.__ - During this stage, the model is exposed to __user messages as input and AI trainer responses as targets.__ - The model learns to __generate responses by minimizing the difference between its predicitons and the provided responses.__ - This phase marks the transition of the model from merely understanding language patterns to __understanding and responding to instructions.__ 3. 🔄 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗿𝗼𝗺 𝗛𝘂𝗺𝗮𝗻 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 (𝗥𝗛𝗙𝗟): - Reinforcement Learning from Human Feedback (RLHF) is employed as a __subsequent fine-tuning step.__ - RLHF aims to align the model's behaviour with human preferences, with a focus on being __helpful, honest, and harmless (HHH)__. - RLHF consists of two crucial sub-steps: - __Training Reward Model Using Human Feedback__: In this sub-step, mutiple __model outputs for the same prompt are generated and ranked by human labelers to create a reward model.__ This model learns human preferences for HHH content. - __Replacing Humans with Reward Model for Large-Scale Training:__ Once the reward model is trained, it can replace __humans in labeling data, streamlining the feedback loop__. Feedback from the reward model is used to further fine-tune the LLM at a large scale. - RLHF plays pivotal role in enhancing the __model's behaviour and ensuring alignment with human values__, thereby guaranteeing useful, truthful, and safe responses. ![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*PJc0XATu7BdpcebVzJO85A.gif) __4. What are some of the strategies to 𝗿𝗲𝗱𝘂𝗰𝗲 𝗵𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻 in large language models (LLMs)?__ __Answer:__ Hallucinations can be detected in LLM at different levels: - Prompt Level - Model Level - Self-check __Prompt Level:__ 1. __Include instructions to the model:__ Include instruction to the model not to make up stuff on its own. For e.g. "__Do not make up stuff outside of given context__" 2. __Repeat:__ Repeat most important instruction multiple times. 3. __Position:__ Position most important instruction at the end, making use of latency effect. 4. __Parameter:__ Keep temperatute to __0__. 5. __Restrict:__ Restrict output to a confined list instead of free float text. 6. __Add CoT type instructions:__ For GPT models __"Let's think step by step"__ works for reasoning tasks, For PaLM - __"Take a deep break and work on this problem step by step"__ outperforms. 7. __Use Few shot examples__ - Use domain/use case specfic few shot examples, also there is a recent study on new techniques called __"Analogical Prompting"__ where model generates its own examples internally which out performs few shot prompting. ![image](https://hackmd.io/_uploads/ByKsDg166.png) 8. __In Context Learning:__ Use In-Context learning to provide better context to the model. 10. __Self-consistency/Voting:__ Generating multiple answers from the model and selecting the most frequent answers. __Model Level:__ 1. __DoLa:__ Decoding by Contrasting Layers Improves Factuality in Large Language Models: Simple decoding strategy in large pre-trained LLMs to reduce hallucinations. https://github.com/voidism/DoLa 2. __Fine-Tuned model on good quality data__ - Fine-tuning small LLM model on goof quality data has shown promising results as well as help reduce hallucinations. __Self-check:__ Methods like chain of verification can help reduce the hallucinations to a great extent, read more here: https://medium.com/@masteringllm/chain-of-verification-to-reduce-hallucinations-db85e6a68645 ### 4.1.1 Prompt engineering & basics of LLM 1. What is the difference between Predictive/ Discriminative AI and generative AI? 2. What is LLM & how LLMs are trained? 3. What is a token in the language model? 4. How to estimate the cost of running a SaaS-based & Open source LLM model? 5. Explain the Temperature parameter and how to set it. 6. What are different decoding strategies for picking output tokens? 7. What are the different ways you can define stopping criteria in a large language model? 8. How to use stop sequence in LLMs? 9. Explain the basic structure of prompt engineering. 10. Explain the type of prompt engineering 11. Explain In-Context Learning 12. What are some of the aspects to keep in mind while using few-shots prompting? 13. What are certain strategies to write good prompts? 14. What is hallucination & how can it be controlled using prompt engineering? 15. How do I improve the reasoning ability of my LLM through prompt engineering? 16. How to improve LLM reasoning if your COT prompt fails? ### 4.1.2 Retrieval augmented generation (RAG) 1. How to increase accuracy, and reliability & make answers verifiable in LLM? 2. How does Retrieval augmented generation (RAG) work? 3. What are some of the benefits of using the RAG system? 4. What are the architecture patterns you see when you want to customize your LLM with proprietary data? 5. When should I use Fine-tuning instead of RAG? ### 4.1.3 Chunking strategies 1. What is chunking and why do we chunk our data? 2. What are factors influences chunk size? 3. What are the different types of chunking methods available? 4. How to find the ideal chunk size? ### 4.1.4 Embedding Models 1. What are vector embeddings? And what is an embedding model? 2. How embedding model is used in the context of LLM application? 3. What is the difference between embedding short and long content? 4. How to benchmark embedding models on your data? 5. Walk me through the steps of improving the sentence transformer model used for embedding ### 4.1.5 Internal working of vector DB 1. What is vector DB? 2. How vector DB is different from traditional databases? 3. How does a vector database work? 4. Explain the difference between vector index, vector DB & vector plugins. 5. What are different vector search strategies? 6. How does clustering reduce search space? When does it fail and how can we mitigate these failures? 7. Explain the Random projection index. 8. Explain the Localitysensitive hashing (LHS) indexing method? 9. Explain the product quantization (PQ) indexing method 10. Compare different Vector indexes and given a scenario, which vector index you would use for a project? 11. How would you decide on ideal search similarity metrics for the use case? 12. Explain the different types and challenges associated with filtering in vector DB. 13. How do you determine the best vector database for your needs? ### 4.1.6 Advanced search algorithms 1. Why it’s important to have very good search 2. What are the architecture patterns for information retrieval & semantic search, and their use cases? 3. How can you achieve efficient and accurate search results in large scale datasets? 4. Explain the keyword-based retrieval method 5. How to fine-tune re-ranking models? 6. Explain most common metric used in information retrieval and when it fails? 7. I have a recommendation system, which metric should I use to evaluate the system? 8. Compare different information retrieval metrics and which one to use when? ### 4.1.7 Language models internal working 1. Detailed understanding of the concept of selfattention 2. Overcoming the disadvantages of the self-attention mechanism 3. Understanding positional encoding 4. Detailed explanation of Transformer architecture 5. Advantages of using a transformer instead of LSTM. 6. Difference between local attention and global attention 7. Understanding the computational and memory demands of transformers 8. Increasing the context length of an LLM. 9. How to Optimizing transformer architecture for large vocabularies 10. What is a mixture of expert models? ### 4.1.8 Supervised finetuning of LLM 1. What is finetuning and why it’ s needed in LLM? 2. Which scenario do we need to finetune LLM? 3. How to make the decision of finetuning? 4. How do you create a fine-tuning dataset for Q&A? 5. How do you improve the model to answer only if there is sufficient context for doing so? 6. How to set hyperparameter for fine-tuning 7. How to estimate infra requirements for fine-tuning LLM? 8. How do you finetune LLM on consumer hardware? 9. What are the different categories of the PEFT method? 10. Explain different reparameterized methods for finetuning LLM? 11. What is catastrophic forgetting in the context of LLMs? ### 4.1.9 Preference Alignment (RLHF/DPO) 1. At which stage you will decide to go for the Preference alignment type of method rather than SFT? 2. Explain Different Preference Alignment Methods? 3. What is RLHF, and how is it used? 4. Explain the reward hacking issue in RLHF. ### 4.1.10 Evaluation of LLM system 1. How do you evaluate the best LLM model for your use case? 2. How to evaluate the RAG-based system? 3. What are the different metrics that can be used to evaluate LLM 4. Explain the Chain of verification ### 4.1.11 Hallucination control techniques 1. What are the different forms of hallucinations? 2. How do you control hallucinations at different levels? ### 4.1.12 Deployment of LLM 1. Why does quantization not decrease the accuracy of LLM? ### 4.1.13 Agent-based system 1. Explain the basic concepts of an agent and the types of strategies available to implement agents. 2. Why do we need agents and what are some common strategies to implement agents? 3. Explain ReAct prompting with a code example and its advantages 4. Explain Plan and Execute prompting strategy 5. Explain OpenAI functions with code examples 6. Explain the difference between OpenAI functions vs LangChain Agents. ### 4.1.14 Prompt Hacking 1. What is prompt hacking and why should we bother about it? 2. What are the different types of prompt hacking? 3. What are the different defense tactics from prompt hacking? ### 4.1.15 Case study & scenario-based Question 1. How to optimize the cost of the overall LLM System? ## 4.2 Classic NLP - https://www.analyticsvidhya.com/blog/2021/06/part-11-step-by-step-guide-to-master-nlp-syntactic-analysis/ ### 4.2.1 TF-IDF & ML (8) 1. Write TF-IDF from scratch. 2. What is normalization in TF-IDF ? 3. Why do you need to know about TF-IDF in our time, and how can you use it in complex models? 4. Explain how Naive Bayes works. What can you use it for? 5. How can SVM be prone to overfitting? 6. Explain possible methods for text preprocessing ( lemmatization and stemming ). What algorithms do you know for this, and in what cases would you use them? 7. What metrics for text similarity do you know? 8. Explain the difference between cosine similarity and cosine distance. Which of these values can be negative? How would you use them? ### 4.2.2 METRICS (7) 9. Explain precision and recall in simple words and what you would look at in the absence of F1 score? 10. In what case would you observe changes in specificity ? 11. When would you look at macro, and when at micro metrics? Why does the weighted metric exist? 12. What is perplexity? What can we consider it with? 13. What is the BLEU metric? 14. Explain the difference between different types of ROUGE metrics? 15. What is the difference between BLUE and ROUGE? ### 4.2.3 WORD2VEC(9) 16. Explain how Word2Vec learns? What is the loss function? What is maximized? 17. What methods of obtaining embeddings do you know? When will each be better? 18. What is the difference between static and contextual embeddings? 19. What are the two main architectures you know, and which one learns faster? 20. What is the difference between Glove, ELMO, FastText, and Word2Vec ? 21. What is negative sampling and why is it needed? What other tricks for Word2Vec do you know, and how can you apply them? 22. What are dense and sparse embeddings? Provide examples. 23. Why might the dimensionality of embeddings be important? 24. What problems can arise when training Word2Vec on short textual data, and how can you deal with them? ### 4.2.4 RNN & CNN(7) 25. How many training parameters are there in a simple 1-layer RNN ? 26. How does RNN training occur? 27. What problems exist in RNN? 28. What types of RNN networks do you know? Explain the difference between GRU and LSTM? 29. What parameters can we tune in such networks? (Stacking, number of layers) 30. What are vanishing gradients for RNN? How do you solve this problem? 31. Why use a Convolutional Neural Network in NLP, and how can you use it? How can you compare CNN within the attention paradigm? ### 4.2.5 ATTENTION AND TRANSFORMER ARCHITECTURE (15) 32. How do you compute attention ? (additional: for what task was it proposed, and why?) 33. Complexity of attention? Compare it with the complexity in RNN. 34. Compare RNN and attention . In what cases would you use attention, and when RNN? 35. Write attention from scratch. 36. Explain masking in attention. 37. What is the dimensionality of the self-attention matrix? 38. What is the difference between BERT and GPT in terms of attention calculation ? 39. What is the dimensionality of the embedding layer in the transformer? 40. Why are embeddings called contextual? How does it work? 41. What is used in transformers, layer norm or batch norm , and why? 42. Why do transformers have PreNorm and PostNorm ? 43. Explain the difference between soft and hard (local/global) attention? 44. Explain multihead attention. 45. What other types of attention mechanisms do you know? What are the purposes of these modifications? 46. How does self-attention become more complex with an increase in the number of heads? ### 4.2.6 TRANSFORMER MODEL TYPES (7) 47. Why does BERT largely lag behind RoBERTa , and what can you take from RoBERTa? 48. What are T5 and BART models? How do they differ? 49. What are task-agnostic models? Provide examples. 50. Explain transformer models by comparing BERT, GPT, and T5. 51. What major problem exists in BERT, GPT, etc., regarding model knowledge? How can this be addressed? 52. How does a decoder-like GPT work during training and inference? What is the difference? 53. Explain the difference between heads and layers in transformer models. ### 4.2.7 POSITIONAL ENCODING (6) 54. Why is information about positions lost in embeddings of transformer models with attention? 55. Explain approaches to positional embeddings and their pros and cons. 56. Why can’t we simply add an embedding with the token index? 57. Why don’t we train positional embeddings? 58. What is relative and absolute positional encoding? 59. Explain in detail the working principle of rotary positional embeddings. ### 4.2.8 PRETRAINING (4) 60. How does causal language modeling work? 61. When do we use a pretrained model? 62. How to train a transformer from scratch? Explain your pipeline, and in what cases would you do this? 63. What models, besides BERT and GPT, do you know for various pretraining tasks? ### 4.2.9 TOKENIZERS (9) 64. What types of tokenizers do you know? Compare them. 65. Can you extend a tokenizer? If yes, in what case would you do this? When would you retrain a tokenizer? What needs to be done when adding new tokens? 66. How do regular tokens differ from special tokens? 67. Why is lemmatization not used in transformers? And why do we need tokens? 68. How is a tokenizer trained? Explain with examples of WordPiece and BPE . 69. What position does the CLS vector occupy? Why? 70. What tokenizer is used in BERT, and which one in GPT? 71. Explain how modern tokenizers handle out-of-vocabulary words? 72. What does the tokenizer vocab size affect? How will you choose it in the case of new training? ### 4.2.10 TRAINING (8) 73. What is class imbalance? How can it be identified? Name all approaches to solving this problem. 74. Can dropout be used during inference, and why? 75. What is the difference between the Adam optimizer and AdamW? 76. How do consumed resources change with gradient accumulation? 77. How to optimize resource consumption during training? 78. What ways of distributed training do you know? 79. What is textual augmentation? Name all methods you know. 80. Why is padding less frequently used? What is done instead? 81. Explain how warm-up works. 82. Explain the concept of gradient clipping? 83. How does teacher forcing work, provide examples? 84. Why and how should skip connections be used? 85. What are adapters? Where and how can we use them? 86. Explain the concepts of metric learning. What approaches do you know? ### 4.2.11 INFERENCE (4) 87. What does the temperature in softmax control? What value would you set? 88. Explain types of sampling in generation? top-k, top-p, nucleus sampling? 89. What is the complexity of beam search, and how does it work? 90. What is sentence embedding? What are the ways you can obtain it? ### 4.2.12 LLM (13) 91. How does LoRA work? How would you choose parameters? Imagine that we want to fine-tune a large language model, apply LORA with a small R, but the model still doesn’t fit in memory. What else can be done? 92. What is the difference between prefix tuning , p-tuning , and prompt tuning ? 93. Explain the scaling law . 94. Explain all stages of LLM training. From which stages can we abstain, and in what cases? 95. How does RAG work? How does it differ from few-shot KNN? 96. What quantization methods do you know? Can we fine-tune quantized models? 97. How can you prevent catastrophic forgetting in LLM? 98. Explain the working principle of KV cache , Grouped-Query Attention , and MultiQuery Attention . 99. Explain the technology behind MixTral, what are its pros and cons? 100. How are you? How are things going?