PyTorch Masterclass: Part Rest – Deep Learning for Natural Language Processing with PyTorch

# PyTorch Masterclass: Part Rest – Deep Learning for Natural Language Processing with PyTorch ## **Training a Transformer** The training process is similar to seq2seq models: ```python # Hyperparameters SRC_VOCAB_SIZE = len(de_vocab) TRG_VOCAB_SIZE = len(en_vocab) EMBED_SIZE = 512 NUM_LAYERS = 6 FORWARD_EXPANSION = 4 HEADS = 8 DROPOUT = 0.1 DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') MAX_LENGTH = 100 LEARNING_RATE = 0.0005 BATCH_SIZE = 32 # Initialize model model = Transformer( SRC_VOCAB_SIZE, TRG_VOCAB_SIZE, de_vocab['<pad>'], en_vocab['<pad>'], EMBED_SIZE, NUM_LAYERS, FORWARD_EXPANSION, HEADS, DROPOUT, DEVICE, MAX_LENGTH ).to(DEVICE) # Optimizer and loss optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) criterion = nn.CrossEntropyLoss(ignore_index=en_vocab['<pad>']) # Training loop (similar to seq2seq) def train(model, dataloader, optimizer, criterion, device): model.train() epoch_loss = 0 for src, trg in dataloader: src, trg = src.to(device), trg.to(device) # Shift target for teacher forcing output = model(src, trg[:, :-1]) output = output.reshape(-1, output.shape[2]) trg = trg[:, 1:].reshape(-1) optimizer.zero_grad() loss = criterion(output, trg) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1) optimizer.step() epoch_loss += loss.item() return epoch_loss / len(dataloader) ``` ### **Pre-trained Transformer Models** Instead of training from scratch, we can use pre-trained models: #### **1. BERT (Bidirectional Encoder Representations from Transformers)** ```python from transformers import BertModel, BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") outputs = model(**inputs) last_hidden_states = outputs.last_hidden_state ``` #### **2. GPT (Generative Pre-trained Transformer)** ```python from transformers import GPT2LMHeadModel, GPT2Tokenizer tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2LMHeadModel.from_pretrained('gpt2') inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") outputs = model(**inputs, labels=inputs["input_ids"]) loss = outputs.loss logits = outputs.logits ``` #### **3. T5 (Text-to-Text Transfer Transformer)** ```python from transformers import T5ForConditionalGeneration, T5Tokenizer tokenizer = T5Tokenizer.from_pretrained('t5-small') model = T5ForConditionalGeneration.from_pretrained('t5-small') input_text = "translate English to German: Hello, my dog is cute" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0])) ``` ### **Fine-tuning BERT for Text Classification** Let's fine-tune BERT for sentiment analysis: ```python from transformers import BertTokenizer, BertForSequenceClassification from torch.utils.data import DataLoader, Dataset # Load tokenizer and model tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained( 'bert-base-uncased', num_labels=2 ).to(device) # Custom dataset class BertDataset(Dataset): def __init__(self, texts, labels, tokenizer, max_len=128): self.texts = texts self.labels = labels self.tokenizer = tokenizer self.max_len = max_len def __len__(self): return len(self.texts) def __getitem__(self, idx): encoding = self.tokenizer.encode_plus( self.texts[idx], add_special_tokens=True, max_length=self.max_len, padding='max_length', truncation=True, return_tensors='pt' ) return { 'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'labels': torch.tensor(self.labels[idx], dtype=torch.long) } # Prepare data train_texts = [text for _, text in train_data] train_labels = [1 if label == 'pos' else 0 for label, _ in train_data] train_dataset = BertDataset(train_texts, train_labels, tokenizer) train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True) # Training setup optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5) loss_fn = nn.CrossEntropyLoss().to(device) # Training loop for epoch in range(3): model.train() for batch in train_loader: input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['labels'].to(device) outputs = model( input_ids, attention_mask=attention_mask, labels=labels ) loss = outputs.loss loss.backward() optimizer.step() optimizer.zero_grad() print(f"Epoch {epoch+1} completed") # Save model model.save_pretrained('./sentiment-bert') tokenizer.save_pretrained('./sentiment-bert') ``` ### **Transformer Variants and Improvements** #### **1. Transformer-XL** Handles longer sequences with segment-level recurrence: ```python from transformers import TransfoXLModel, TransfoXLTokenizer tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103') model = TransfoXLModel.from_pretrained('transfo-xl-wt103') ``` #### **2. Longformer** Uses attention pattern to handle long documents: ```python from transformers import LongformerModel, LongformerTokenizer tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096') model = LongformerModel.from_pretrained('allenai/longformer-base-4096') ``` #### **3. Reformer** Uses locality-sensitive hashing for efficient attention: ```python from transformers import ReformerModel, ReformerTokenizer tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment') model = ReformerModel.from_pretrained('google/reformer-crime-and-punishment') ``` #### **4. BigBird** Combines random, window and global attention: ```python from transformers import BigBirdModel, BigBirdTokenizer tokenizer = BigBirdTokenizer.from_pretrained('google/bigbird-roberta-base') model = BigBirdModel.from_pretrained('google/bigbird-roberta-base') ``` --- ## **Building a Sentiment Analysis Model from Scratch** Let's build a complete sentiment analysis pipeline using the techniques we've learned. ### **Step 1: Data Preparation** We'll use the IMDB dataset with enhanced preprocessing: ```python import pandas as pd from sklearn.model_selection import train_test_split import re import string import torch from torch.utils.data import Dataset, DataLoader # Load data train_data = pd.read_csv('imdb_train.csv') test_data = pd.read_csv('imdb_test.csv') # Enhanced text cleaning def clean_text(text): # Convert to lowercase text = text.lower() # Remove HTML tags text = re.sub(r'<.*?>', '', text) # Remove URLs text = re.sub(r'https?://\S+|www\.\S+', '', text) # Remove email addresses text = re.sub(r'\S+@\S+', '', text) # Remove non-ASCII characters text = re.sub(r'[^\x00-\x7F]+', ' ', text) # Remove punctuation text = text.translate(str.maketrans('', '', string.punctuation)) # Remove extra whitespace text = re.sub(r'\s+', ' ', text).strip() return text # Apply cleaning train_data['review'] = train_data['review'].apply(clean_text) test_data['review'] = test_data['review'].apply(clean_text) # Split training data into train and validation train_texts, val_texts, train_labels, val_labels = train_test_split( train_data['review'].values, train_data['sentiment'].values, test_size=0.1, random_state=42 ) ``` ### **Step 2: Advanced Tokenization with Subwords** Let's use BERT's WordPiece tokenizer for better handling of rare words: ```python from tokenizers import BertWordPieceTokenizer # Initialize tokenizer tokenizer = BertWordPieceTokenizer( clean_text=True, handle_chinese_chars=False, strip_accents=True, lowercase=True ) # Train tokenizer tokenizer.train( files=["imdb_train.txt"], vocab_size=30000, min_frequency=2, special_tokens=[ "[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]" ], limit_alphabet=1000, wordpieces_prefix="##" ) # Save tokenizer tokenizer.save_model("imdb_tokenizer") # Tokenization function def encode_texts(texts, max_length=256): encodings = tokenizer.encode_batch( texts, add_special_tokens=True, max_length=max_length, truncation=True, padding="max_length" ) input_ids = torch.tensor([e.ids for e in encodings]) attention_mask = torch.tensor([e.attention_mask for e in encodings]) return input_ids, attention_mask # Encode data train_ids, train_mask = encode_texts(train_texts) val_ids, val_mask = encode_texts(val_texts) test_ids, test_mask = encode_texts(test_data['review'].values) # Convert labels train_labels = torch.tensor(train_labels) val_labels = torch.tensor(val_labels) test_labels = torch.tensor(test_data['sentiment'].values) ``` ### **Step 3: Custom Dataset and DataLoader** ```python class SentimentDataset(Dataset): def __init__(self, input_ids, attention_mask, labels): self.input_ids = input_ids self.attention_mask = attention_mask self.labels = labels def __len__(self): return len(self.labels) def __getitem__(self, idx): return { 'input_ids': self.input_ids[idx], 'attention_mask': self.attention_mask[idx], 'labels': self.labels[idx] } # Create datasets train_dataset = SentimentDataset(train_ids, train_mask, train_labels) val_dataset = SentimentDataset(val_ids, val_mask, val_labels) test_dataset = SentimentDataset(test_ids, test_mask, test_labels) # Create data loaders train_loader = DataLoader( train_dataset, batch_size=16, shuffle=True, num_workers=4 ) val_loader = DataLoader( val_dataset, batch_size=16, shuffle=False, num_workers=4 ) test_loader = DataLoader( test_dataset, batch_size=16, shuffle=False, num_workers=4 ) ``` ### **Step 4: Building the Model with BERT** Let's fine-tune BERT for sentiment analysis: ```python from transformers import BertModel, BertConfig import torch.nn as nn class BERTSentimentClassifier(nn.Module): def __init__(self, num_classes=2, dropout=0.1): super(BERTSentimentClassifier, self).__init__() # Load BERT configuration self.bert = BertModel.from_pretrained('bert-base-uncased') # Freeze BERT layers (optional) for param in self.bert.parameters(): param.requires_grad = False # Classification head self.dropout = nn.Dropout(dropout) self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes) def forward(self, input_ids, attention_mask): # Get BERT outputs outputs = self.bert( input_ids=input_ids, attention_mask=attention_mask ) # Use the [CLS] token representation pooled_output = outputs.pooler_output # Classification pooled_output = self.dropout(pooled_output) logits = self.classifier(pooled_output) return logits # Initialize model model = BERTSentimentClassifier(num_classes=2, dropout=0.1) model = model.to(device) # Advanced optimizer with weight decay from transformers import AdamW from torch.optim.lr_scheduler import ReduceLROnPlateau optimizer = AdamW( model.parameters(), lr=2e-5, weight_decay=0.01, correct_bias=False ) # Learning rate scheduler scheduler = ReduceLROnPlateau( optimizer, mode='max', factor=0.5, patience=2, verbose=True ) # Loss function criterion = nn.CrossEntropyLoss() ``` ### **Step 5: Advanced Training Techniques** Let's implement advanced training techniques for better performance: ```python from tqdm import tqdm import numpy as np from sklearn.metrics import accuracy_score, f1_score, classification_report def train_epoch(model, dataloader, optimizer, criterion, device, scheduler=None): model.train() total_loss = 0 all_preds = [] all_labels = [] progress_bar = tqdm(dataloader, desc="Training") for batch in progress_bar: input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['labels'].to(device) # Forward pass optimizer.zero_grad() outputs = model(input_ids, attention_mask) loss = criterion(outputs, labels) # Backward pass loss.backward() # Gradient clipping torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Update parameters optimizer.step() # Record metrics total_loss += loss.item() preds = torch.argmax(outputs, dim=1).cpu().numpy() all_preds.extend(preds) all_labels.extend(labels.cpu().numpy()) # Update progress bar progress_bar.set_postfix(loss=loss.item()) # Calculate epoch metrics avg_loss = total_loss / len(dataloader) accuracy = accuracy_score(all_labels, all_preds) f1 = f1_score(all_labels, all_preds) return avg_loss, accuracy, f1 def evaluate(model, dataloader, criterion, device): model.eval() total_loss = 0 all_preds = [] all_labels = [] with torch.no_grad(): for batch in tqdm(dataloader, desc="Evaluating"): input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['labels'].to(device) # Forward pass outputs = model(input_ids, attention_mask) loss = criterion(outputs, labels) # Record metrics total_loss += loss.item() preds = torch.argmax(outputs, dim=1).cpu().numpy() all_preds.extend(preds) all_labels.extend(labels.cpu().numpy()) # Calculate metrics avg_loss = total_loss / len(dataloader) accuracy = accuracy_score(all_labels, all_preds) f1 = f1_score(all_labels, all_preds) return avg_loss, accuracy, f1, all_preds, all_labels # Training loop with early stopping best_val_acc = 0 patience = 3 trigger_times = 0 for epoch in range(10): print(f"\nEpoch {epoch+1}/10") # Train train_loss, train_acc, train_f1 = train_epoch( model, train_loader, optimizer, criterion, device ) # Evaluate val_loss, val_acc, val_f1, _, _ = evaluate( model, val_loader, criterion, device ) # Update learning rate scheduler.step(val_acc) # Print metrics print(f"Train Loss: {train_loss:.4f} | Acc: {train_acc:.4f} | F1: {train_f1:.4f}") print(f"Val Loss: {val_loss:.4f} | Acc: {val_acc:.4f} | F1: {val_f1:.4f}") # Early stopping if val_acc > best_val_acc: best_val_acc = val_acc trigger_times = 0 # Save best model torch.save(model.state_dict(), 'best_bert_sentiment.pth') else: trigger_times += 1 if trigger_times >= patience: print(f"Early stopping at epoch {epoch+1}") break ``` ### **Step 6: Model Evaluation and Analysis** ```python # Load best model model.load_state_dict(torch.load('best_bert_sentiment.pth')) # Final evaluation on test set test_loss, test_acc, test_f1, test_preds, test_labels = evaluate( model, test_loader, criterion, device ) print(f"\nTest Loss: {test_loss:.4f} | Accuracy: {test_acc:.4f} | F1: {test_f1:.4f}") # Detailed classification report print("\nClassification Report:") print(classification_report(test_labels, test_preds, target_names=['Negative', 'Positive'])) # Confusion matrix from sklearn.metrics import confusion_matrix import seaborn as sns cm = confusion_matrix(test_labels, test_preds) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive']) plt.xlabel('Predicted') plt.ylabel('True') plt.title('Confusion Matrix') plt.show() ``` ### **Step 7: Error Analysis** Let's analyze where the model makes mistakes: ```python def analyze_errors(model, dataloader, texts, labels, device): model.eval() error_indices = [] with torch.no_grad(): for i, batch in enumerate(dataloader): input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) batch_labels = batch['labels'].to(device) outputs = model(input_ids, attention_mask) preds = torch.argmax(outputs, dim=1) # Find incorrect predictions incorrect = (preds != batch_labels).nonzero().squeeze().tolist() if isinstance(incorrect, int): incorrect = [incorrect] # Store indices of errors for idx in incorrect: global_idx = i * dataloader.batch_size + idx if global_idx < len(texts): error_indices.append(global_idx) # Print some error examples print("\nError Analysis (first 5 examples):") for i in error_indices[:5]: print(f"\nText: {texts[i][:200]}...") print(f"True label: {labels[i]} | Predicted: {test_preds[i]}") print("-" * 50) return error_indices # Perform error analysis error_indices = analyze_errors( model, test_loader, test_data['review'].values, test_labels, device ) ``` ### **Step 8: Model Interpretation with Attention Visualization** Let's visualize attention to understand what the model focuses on: ```python from IPython.display import display, HTML def visualize_attention(model, text, tokenizer, device): model.eval() # Tokenize input inputs = tokenizer( text, return_tensors="pt", max_length=128, padding="max_length", truncation=True ).to(device) # Get model outputs with attention with torch.no_grad(): outputs = model.bert( input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], output_attentions=True ) # Get attention from last layer attentions = outputs.attentions[-1][0] # [heads, seq_len, seq_len] # Average over heads avg_attention = torch.mean(attentions, dim=0).cpu().numpy() # Get tokens tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0].cpu().numpy()) # Remove special tokens for display display_tokens = [] display_attention = [] for i, token in enumerate(tokens): if token not in ["[CLS]", "[SEP]", "[PAD]"]: display_tokens.append(token) display_attention.append(avg_attention[i, :len(display_tokens)]) # Create HTML visualization html = """ Input text: %s Model attention visualization: """ % text for i, token in enumerate(display_tokens): # Calculate attention strength for this token strength = np.mean(display_attention[:i+1, i]) intensity = int(50 + 150 * strength) color = f"rgba(100, 150, 255, {strength:.2f})" html += f'{token} ' display(HTML(html)) # Example usage sample_text = "This movie was absolutely fantastic! The acting was superb and the plot was engaging." visualize_attention(model, sample_text, tokenizer, device) ``` This will display the text with each token highlighted according to how much attention the model paid to it, helping us understand which words influenced the prediction. ### **Step 9: Model Deployment** Let's create a function to make predictions on new text: ```python def predict_sentiment(text, model, tokenizer, device, max_length=128): model.eval() # Clean and tokenize text cleaned_text = clean_text(text) inputs = tokenizer( cleaned_text, return_tensors="pt", max_length=max_length, padding="max_length", truncation=True ).to(device) # Get prediction with torch.no_grad(): outputs = model( input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'] ) probs = torch.softmax(outputs, dim=1) pred = torch.argmax(probs, dim=1).item() confidence = probs[0][pred].item() # Map prediction to label label = "Positive" if pred == 1 else "Negative" return { "text": text, "cleaned_text": cleaned_text, "sentiment": label, "confidence": confidence, "probabilities": { "negative": probs[0][0].item(), "positive": probs[0][1].item() } } # Example usage result = predict_sentiment( "I really hated this movie. The plot was boring and the acting was terrible.", model, tokenizer, device ) print(f"Text: {result['text']}") print(f"Sentiment: {result['sentiment']} (confidence: {result['confidence']:.2f})") print(f"Probabilities: Negative: {result['probabilities']['negative']:.2f}, Positive: {result['probabilities']['positive']:.2f}") ``` ### **Step 10: Model Optimization for Production** For production deployment, we can optimize the model: ```python # 1. Convert to ONNX format torch.onnx.export( model, ( torch.randint(0, 30000, (1, 128)).to(device), torch.ones(1, 128).to(device) ), "sentiment_bert.onnx", input_names=["input_ids", "attention_mask"], output_names=["logits"], dynamic_axes={ "input_ids": {0: "batch", 1: "sequence"}, "attention_mask": {0: "batch", 1: "sequence"}, "logits": {0: "batch"} }, opset_version=11 ) # 2. Quantize the model for faster inference from torch.quantization import quantize_dynamic quantized_model = quantize_dynamic( model, {nn.Linear, nn.LayerNorm}, dtype=torch.qint8 ) # 3. Save quantized model torch.save(quantized_model.state_dict(), 'quantized_bert_sentiment.pth') # 4. Benchmark inference speed import time def benchmark(model, dataloader, device, num_batches=10): model.eval() start_time = time.time() with torch.no_grad(): for i, batch in enumerate(dataloader): if i >= num_batches: break input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) _ = model(input_ids, attention_mask) elapsed = time.time() - start_time print(f"Processed {num_batches * dataloader.batch_size} samples in {elapsed:.2f} seconds") print(f"Average inference time: {elapsed / (num_batches * dataloader.batch_size) * 1000:.2f} ms/sample") # Benchmark original and quantized models print("Original model benchmark:") benchmark(model, test_loader, device) print("\nQuantized model benchmark:") benchmark(quantized_model, test_loader, device) ``` ### **Expected Results and Improvements** With this pipeline, you should achieve: - **~94-95% accuracy** on the IMDB test set - Clear understanding of model decisions through attention visualization - Production-ready model with quantization for faster inference To further improve performance: 1. **Use a larger pre-trained model**: ```python model = BertForSequenceClassification.from_pretrained('bert-large-uncased') ``` 2. **Apply advanced fine-tuning techniques**: - Gradual unfreezing - Discriminative learning rates - Slanted triangular learning rates 3. **Ensemble multiple models**: ```python # Train multiple models with different seeds models = [BERTSentimentClassifier() for _ in range(5)] # Average predictions def ensemble_predict(text): probs = [] for model in models: result = predict_sentiment(text, model, tokenizer, device) probs.append(result['probabilities']) avg_neg = np.mean([p['negative'] for p in probs]) avg_pos = np.mean([p['positive'] for p in probs]) return "Positive" if avg_pos > avg_neg else "Negative" ``` 4. **Add adversarial training**: ```python # Generate adversarial examples during training def add_adversarial_noise(embeddings, epsilon=0.01): noise = torch.randn_like(embeddings) * epsilon return embeddings + noise ``` --- ## **Quiz 3: Test Your Understanding of NLP with PyTorch** **1. What is the primary advantage of using subword tokenization over word-level tokenization?** A) It reduces the vocabulary size significantly B) It handles out-of-vocabulary words more effectively C) It preserves the original word order better D) It requires less computational resources **2. In a standard LSTM, what is the purpose of the cell state (C_t)?** A) To serve as the output of the LSTM at each time step B) To carry long-term information through the network C) To determine which information to forget from the previous state D) To normalize the input data before processing **3. What problem does the attention mechanism primarily solve in sequence-to-sequence models?** A) The vanishing gradient problem B) The limitation of fixed-length context vectors for long sequences C) The high computational cost of training RNNs D) The difficulty of handling variable-length input sequences **4. In the Transformer architecture, what is the purpose of positional encoding?** A) To normalize the input embeddings B) To provide information about the position of tokens in the sequence C) To reduce the dimensionality of the input data D) To create attention masks for the decoder **5. Which of the following is NOT a component of the Transformer's encoder layer?** A) Multi-head self-attention B) Position-wise feed-forward network C) Masked multi-head attention D) Residual connections and layer normalization **6. When fine-tuning BERT for a text classification task, which token's representation is typically used for classification?** A) The [SEP] token B) The last token of the sequence C) The [CLS] token D) The average of all token representations **7. What is the main difference between GRU and LSTM?** A) GRU has more gates than LSTM B) GRU combines the cell state and hidden state into a single state C) LSTM does not use gating mechanisms D) GRU cannot handle long-term dependencies **8. In the context of NLP, what does "teacher forcing" refer to?** A) Using the model's own predictions as input during training B) Using the ground truth output as input during training C) Training the model with human-annotated data only D) Forcing the model to use specific attention weights **9. Which technique would most directly address the problem of exploding gradients in RNN training?** A) Increasing the learning rate B) Using a larger batch size C) Gradient clipping D) Adding more layers to the network **10. What is the key innovation of the Transformer architecture compared to previous sequence models?** A) The use of convolutional layers for text processing B) Replacing recurrence with self-attention mechanisms C) Implementing a novel word embedding technique D) Using reinforcement learning for sequence generation **11. In BERT, what does the "MLM" (Masked Language Modeling) task involve?** A) Predicting the next sentence in a pair of sentences B) Predicting randomly masked tokens in the input C) Translating text from one language to another D) Classifying the sentiment of a given text **12. What is the primary purpose of layer normalization in deep neural networks?** A) To reduce the number of parameters in the model B) To make the training process more stable and faster C) To prevent overfitting by adding noise to the inputs D) To increase the representational capacity of the network **13. Which of the following best describes "beam search" in sequence generation?** A) A greedy approach that selects the highest probability token at each step B) A technique that keeps multiple hypotheses during decoding C) A method for training sequence models with reinforcement learning D) An algorithm for optimizing the learning rate during training **14. What is the main advantage of using pre-trained language models like BERT for downstream NLP tasks?** A) They require less data for fine-tuning compared to training from scratch B) They eliminate the need for task-specific architectures C) They guarantee state-of-the-art performance on all NLP tasks D) They are faster to train than traditional RNNs **15. In the context of attention mechanisms, what does "self-attention" refer to?** A) Attention between the encoder and decoder in a seq2seq model B) Attention where the queries, keys, and values all come from the same source C) Attention that focuses on the most important words in a sentence D) A specialized attention mechanism for handling long sequences --- **Answers:** 1. B - Subword tokenization handles OOV words by breaking them into known subwords 2. B - The cell state carries long-term information through the network 3. B - Attention solves the limitation of fixed-length context vectors for long sequences 4. B - Positional encoding provides information about token positions 5. C - Masked attention is only in the decoder, not the encoder 6. C - BERT uses the [CLS] token representation for classification 7. B - GRU combines cell state and hidden state into a single state 8. B - Teacher forcing uses ground truth as input during training 9. C - Gradient clipping directly addresses exploding gradients 10. B - Transformer replaces recurrence with self-attention 11. B - MLM involves predicting randomly masked tokens 12. B - Layer normalization stabilizes and speeds up training 13. B - Beam search keeps multiple hypotheses during decoding 14. A - Pre-trained models require less data for fine-tuning 15. B - Self-attention uses same source for queries, keys, and values --- ## **Summary and What's Next in Part 4** In this **comprehensive Part 3** of our PyTorch Masterclass, we've covered: - **Text data processing**: Cleaning, tokenization, and vocabulary building - **Word embeddings**: From one-hot to contextual representations - **Recurrent Neural Networks**: Theory and implementation of RNNs, LSTMs, and GRUs - **Sequence-to-sequence models**: Machine translation with attention - **Attention mechanisms**: The foundation of modern NLP - **Transformer architecture**: Complete implementation and understanding - **Sentiment analysis**: End-to-end pipeline from data to deployment You now have the skills to: - Process and represent text for deep learning - Build and train RNNs, LSTMs, and GRUs for sequence tasks - Implement attention mechanisms for better context understanding - Work with Transformer models like BERT for state-of-the-art NLP - Create complete NLP applications from scratch ### **What's Coming in Part 4?** In **Part 4**, we'll dive into **Generative Models** with PyTorch: - **Autoencoders**: Understanding latent space representations - **Variational Autoencoders (VAEs)**: Probabilistic generative modeling - **Generative Adversarial Networks (GANs)**: Creating realistic synthetic data - **Diffusion Models**: The new frontier in generative AI - **Text-to-Image Generation**: Building models like DALL-E - **Music and Audio Generation**: Creating novel audio content - **Evaluating Generative Models**: Metrics and qualitative assessment We'll build a **complete image generation pipeline** and explore the architectures behind cutting-edge generative AI systems. 👉 **Stay tuned for Part 4: Generative Models with PyTorch** --- **Hashtags:** #PyTorch #NLP #RNN #LSTM #GRU #Transformers #Attention #NaturalLanguageProcessing #TextClassification #SentimentAnalysis #WordEmbeddings #DeepLearning #MachineLearning #AI #SequenceModeling #BERT #GPT #TextProcessing #PyTorchNLP #HuggingFace #TransformerArchitecture #SelfAttention #MachineTranslation #TextGeneration #NamedEntityRecognition #PartOfSpeechTagging #Tokenization #WordPiece #BytePairEncoding #GloVe #Word2Vec #FastText #Seq2Seq #EncoderDecoder #BeamSearch #TeacherForcing #TextSummarization #QuestionAnswering #TextToText #FineTuning #PretrainedModels #PyTorchTutorial #DeepLearningCourse #AIEngineering #NLPEngineering