# PyTorch Masterclass: Part Rest – Deep Learning for Natural Language Processing with PyTorch
## **Training a Transformer**
The training process is similar to seq2seq models:
```python
# Hyperparameters
SRC_VOCAB_SIZE = len(de_vocab)
TRG_VOCAB_SIZE = len(en_vocab)
EMBED_SIZE = 512
NUM_LAYERS = 6
FORWARD_EXPANSION = 4
HEADS = 8
DROPOUT = 0.1
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
MAX_LENGTH = 100
LEARNING_RATE = 0.0005
BATCH_SIZE = 32
# Initialize model
model = Transformer(
SRC_VOCAB_SIZE,
TRG_VOCAB_SIZE,
de_vocab['<pad>'],
en_vocab['<pad>'],
EMBED_SIZE,
NUM_LAYERS,
FORWARD_EXPANSION,
HEADS,
DROPOUT,
DEVICE,
MAX_LENGTH
).to(DEVICE)
# Optimizer and loss
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss(ignore_index=en_vocab['<pad>'])
# Training loop (similar to seq2seq)
def train(model, dataloader, optimizer, criterion, device):
model.train()
epoch_loss = 0
for src, trg in dataloader:
src, trg = src.to(device), trg.to(device)
# Shift target for teacher forcing
output = model(src, trg[:, :-1])
output = output.reshape(-1, output.shape[2])
trg = trg[:, 1:].reshape(-1)
optimizer.zero_grad()
loss = criterion(output, trg)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
optimizer.step()
epoch_loss += loss.item()
return epoch_loss / len(dataloader)
```
### **Pre-trained Transformer Models**
Instead of training from scratch, we can use pre-trained models:
#### **1. BERT (Bidirectional Encoder Representations from Transformers)**
```python
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
```
#### **2. GPT (Generative Pre-trained Transformer)**
```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
logits = outputs.logits
```
#### **3. T5 (Text-to-Text Transfer Transformer)**
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_text = "translate English to German: Hello, my dog is cute"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))
```
### **Fine-tuning BERT for Text Classification**
Let's fine-tune BERT for sentiment analysis:
```python
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2
).to(device)
# Custom dataset
class BertDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer.encode_plus(
self.texts[idx],
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'labels': torch.tensor(self.labels[idx], dtype=torch.long)
}
# Prepare data
train_texts = [text for _, text in train_data]
train_labels = [1 if label == 'pos' else 0 for label, _ in train_data]
train_dataset = BertDataset(train_texts, train_labels, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
# Training setup
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = nn.CrossEntropyLoss().to(device)
# Training loop
for epoch in range(3):
model.train()
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(
input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Epoch {epoch+1} completed")
# Save model
model.save_pretrained('./sentiment-bert')
tokenizer.save_pretrained('./sentiment-bert')
```
### **Transformer Variants and Improvements**
#### **1. Transformer-XL**
Handles longer sequences with segment-level recurrence:
```python
from transformers import TransfoXLModel, TransfoXLTokenizer
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
```
#### **2. Longformer**
Uses attention pattern to handle long documents:
```python
from transformers import LongformerModel, LongformerTokenizer
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model = LongformerModel.from_pretrained('allenai/longformer-base-4096')
```
#### **3. Reformer**
Uses locality-sensitive hashing for efficient attention:
```python
from transformers import ReformerModel, ReformerTokenizer
tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')
model = ReformerModel.from_pretrained('google/reformer-crime-and-punishment')
```
#### **4. BigBird**
Combines random, window and global attention:
```python
from transformers import BigBirdModel, BigBirdTokenizer
tokenizer = BigBirdTokenizer.from_pretrained('google/bigbird-roberta-base')
model = BigBirdModel.from_pretrained('google/bigbird-roberta-base')
```
---
## **Building a Sentiment Analysis Model from Scratch**
Let's build a complete sentiment analysis pipeline using the techniques we've learned.
### **Step 1: Data Preparation**
We'll use the IMDB dataset with enhanced preprocessing:
```python
import pandas as pd
from sklearn.model_selection import train_test_split
import re
import string
import torch
from torch.utils.data import Dataset, DataLoader
# Load data
train_data = pd.read_csv('imdb_train.csv')
test_data = pd.read_csv('imdb_test.csv')
# Enhanced text cleaning
def clean_text(text):
# Convert to lowercase
text = text.lower()
# Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)
# Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# Remove non-ASCII characters
text = re.sub(r'[^\x00-\x7F]+', ' ', text)
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
# Apply cleaning
train_data['review'] = train_data['review'].apply(clean_text)
test_data['review'] = test_data['review'].apply(clean_text)
# Split training data into train and validation
train_texts, val_texts, train_labels, val_labels = train_test_split(
train_data['review'].values,
train_data['sentiment'].values,
test_size=0.1,
random_state=42
)
```
### **Step 2: Advanced Tokenization with Subwords**
Let's use BERT's WordPiece tokenizer for better handling of rare words:
```python
from tokenizers import BertWordPieceTokenizer
# Initialize tokenizer
tokenizer = BertWordPieceTokenizer(
clean_text=True,
handle_chinese_chars=False,
strip_accents=True,
lowercase=True
)
# Train tokenizer
tokenizer.train(
files=["imdb_train.txt"],
vocab_size=30000,
min_frequency=2,
special_tokens=[
"[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"
],
limit_alphabet=1000,
wordpieces_prefix="##"
)
# Save tokenizer
tokenizer.save_model("imdb_tokenizer")
# Tokenization function
def encode_texts(texts, max_length=256):
encodings = tokenizer.encode_batch(
texts,
add_special_tokens=True,
max_length=max_length,
truncation=True,
padding="max_length"
)
input_ids = torch.tensor([e.ids for e in encodings])
attention_mask = torch.tensor([e.attention_mask for e in encodings])
return input_ids, attention_mask
# Encode data
train_ids, train_mask = encode_texts(train_texts)
val_ids, val_mask = encode_texts(val_texts)
test_ids, test_mask = encode_texts(test_data['review'].values)
# Convert labels
train_labels = torch.tensor(train_labels)
val_labels = torch.tensor(val_labels)
test_labels = torch.tensor(test_data['sentiment'].values)
```
### **Step 3: Custom Dataset and DataLoader**
```python
class SentimentDataset(Dataset):
def __init__(self, input_ids, attention_mask, labels):
self.input_ids = input_ids
self.attention_mask = attention_mask
self.labels = labels
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
return {
'input_ids': self.input_ids[idx],
'attention_mask': self.attention_mask[idx],
'labels': self.labels[idx]
}
# Create datasets
train_dataset = SentimentDataset(train_ids, train_mask, train_labels)
val_dataset = SentimentDataset(val_ids, val_mask, val_labels)
test_dataset = SentimentDataset(test_ids, test_mask, test_labels)
# Create data loaders
train_loader = DataLoader(
train_dataset,
batch_size=16,
shuffle=True,
num_workers=4
)
val_loader = DataLoader(
val_dataset,
batch_size=16,
shuffle=False,
num_workers=4
)
test_loader = DataLoader(
test_dataset,
batch_size=16,
shuffle=False,
num_workers=4
)
```
### **Step 4: Building the Model with BERT**
Let's fine-tune BERT for sentiment analysis:
```python
from transformers import BertModel, BertConfig
import torch.nn as nn
class BERTSentimentClassifier(nn.Module):
def __init__(self, num_classes=2, dropout=0.1):
super(BERTSentimentClassifier, self).__init__()
# Load BERT configuration
self.bert = BertModel.from_pretrained('bert-base-uncased')
# Freeze BERT layers (optional)
for param in self.bert.parameters():
param.requires_grad = False
# Classification head
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)
def forward(self, input_ids, attention_mask):
# Get BERT outputs
outputs = self.bert(
input_ids=input_ids,
attention_mask=attention_mask
)
# Use the [CLS] token representation
pooled_output = outputs.pooler_output
# Classification
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
return logits
# Initialize model
model = BERTSentimentClassifier(num_classes=2, dropout=0.1)
model = model.to(device)
# Advanced optimizer with weight decay
from transformers import AdamW
from torch.optim.lr_scheduler import ReduceLROnPlateau
optimizer = AdamW(
model.parameters(),
lr=2e-5,
weight_decay=0.01,
correct_bias=False
)
# Learning rate scheduler
scheduler = ReduceLROnPlateau(
optimizer,
mode='max',
factor=0.5,
patience=2,
verbose=True
)
# Loss function
criterion = nn.CrossEntropyLoss()
```
### **Step 5: Advanced Training Techniques**
Let's implement advanced training techniques for better performance:
```python
from tqdm import tqdm
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report
def train_epoch(model, dataloader, optimizer, criterion, device, scheduler=None):
model.train()
total_loss = 0
all_preds = []
all_labels = []
progress_bar = tqdm(dataloader, desc="Training")
for batch in progress_bar:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
# Forward pass
optimizer.zero_grad()
outputs = model(input_ids, attention_mask)
loss = criterion(outputs, labels)
# Backward pass
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Update parameters
optimizer.step()
# Record metrics
total_loss += loss.item()
preds = torch.argmax(outputs, dim=1).cpu().numpy()
all_preds.extend(preds)
all_labels.extend(labels.cpu().numpy())
# Update progress bar
progress_bar.set_postfix(loss=loss.item())
# Calculate epoch metrics
avg_loss = total_loss / len(dataloader)
accuracy = accuracy_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds)
return avg_loss, accuracy, f1
def evaluate(model, dataloader, criterion, device):
model.eval()
total_loss = 0
all_preds = []
all_labels = []
with torch.no_grad():
for batch in tqdm(dataloader, desc="Evaluating"):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
# Forward pass
outputs = model(input_ids, attention_mask)
loss = criterion(outputs, labels)
# Record metrics
total_loss += loss.item()
preds = torch.argmax(outputs, dim=1).cpu().numpy()
all_preds.extend(preds)
all_labels.extend(labels.cpu().numpy())
# Calculate metrics
avg_loss = total_loss / len(dataloader)
accuracy = accuracy_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds)
return avg_loss, accuracy, f1, all_preds, all_labels
# Training loop with early stopping
best_val_acc = 0
patience = 3
trigger_times = 0
for epoch in range(10):
print(f"\nEpoch {epoch+1}/10")
# Train
train_loss, train_acc, train_f1 = train_epoch(
model, train_loader, optimizer, criterion, device
)
# Evaluate
val_loss, val_acc, val_f1, _, _ = evaluate(
model, val_loader, criterion, device
)
# Update learning rate
scheduler.step(val_acc)
# Print metrics
print(f"Train Loss: {train_loss:.4f} | Acc: {train_acc:.4f} | F1: {train_f1:.4f}")
print(f"Val Loss: {val_loss:.4f} | Acc: {val_acc:.4f} | F1: {val_f1:.4f}")
# Early stopping
if val_acc > best_val_acc:
best_val_acc = val_acc
trigger_times = 0
# Save best model
torch.save(model.state_dict(), 'best_bert_sentiment.pth')
else:
trigger_times += 1
if trigger_times >= patience:
print(f"Early stopping at epoch {epoch+1}")
break
```
### **Step 6: Model Evaluation and Analysis**
```python
# Load best model
model.load_state_dict(torch.load('best_bert_sentiment.pth'))
# Final evaluation on test set
test_loss, test_acc, test_f1, test_preds, test_labels = evaluate(
model, test_loader, criterion, device
)
print(f"\nTest Loss: {test_loss:.4f} | Accuracy: {test_acc:.4f} | F1: {test_f1:.4f}")
# Detailed classification report
print("\nClassification Report:")
print(classification_report(test_labels, test_preds, target_names=['Negative', 'Positive']))
# Confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(test_labels, test_preds)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Negative', 'Positive'],
yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
```
### **Step 7: Error Analysis**
Let's analyze where the model makes mistakes:
```python
def analyze_errors(model, dataloader, texts, labels, device):
model.eval()
error_indices = []
with torch.no_grad():
for i, batch in enumerate(dataloader):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
batch_labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask)
preds = torch.argmax(outputs, dim=1)
# Find incorrect predictions
incorrect = (preds != batch_labels).nonzero().squeeze().tolist()
if isinstance(incorrect, int):
incorrect = [incorrect]
# Store indices of errors
for idx in incorrect:
global_idx = i * dataloader.batch_size + idx
if global_idx < len(texts):
error_indices.append(global_idx)
# Print some error examples
print("\nError Analysis (first 5 examples):")
for i in error_indices[:5]:
print(f"\nText: {texts[i][:200]}...")
print(f"True label: {labels[i]} | Predicted: {test_preds[i]}")
print("-" * 50)
return error_indices
# Perform error analysis
error_indices = analyze_errors(
model, test_loader, test_data['review'].values, test_labels, device
)
```
### **Step 8: Model Interpretation with Attention Visualization**
Let's visualize attention to understand what the model focuses on:
```python
from IPython.display import display, HTML
def visualize_attention(model, text, tokenizer, device):
model.eval()
# Tokenize input
inputs = tokenizer(
text,
return_tensors="pt",
max_length=128,
padding="max_length",
truncation=True
).to(device)
# Get model outputs with attention
with torch.no_grad():
outputs = model.bert(
input_ids=inputs['input_ids'],
attention_mask=inputs['attention_mask'],
output_attentions=True
)
# Get attention from last layer
attentions = outputs.attentions[-1][0] # [heads, seq_len, seq_len]
# Average over heads
avg_attention = torch.mean(attentions, dim=0).cpu().numpy()
# Get tokens
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0].cpu().numpy())
# Remove special tokens for display
display_tokens = []
display_attention = []
for i, token in enumerate(tokens):
if token not in ["[CLS]", "[SEP]", "[PAD]"]:
display_tokens.append(token)
display_attention.append(avg_attention[i, :len(display_tokens)])
# Create HTML visualization
html = """
<p><strong>Input text:</strong> %s</p>
<p><strong>Model attention visualization:</strong></p>
""" % text
for i, token in enumerate(display_tokens):
# Calculate attention strength for this token
strength = np.mean(display_attention[:i+1, i])
intensity = int(50 + 150 * strength)
color = f"rgba(100, 150, 255, {strength:.2f})"
html += f'<span style="background-color: {color}; padding: 2px 4px; margin: 0 1px; border-radius: 3px;">{token}</span> '
display(HTML(html))
# Example usage
sample_text = "This movie was absolutely fantastic! The acting was superb and the plot was engaging."
visualize_attention(model, sample_text, tokenizer, device)
```
This will display the text with each token highlighted according to how much attention the model paid to it, helping us understand which words influenced the prediction.
### **Step 9: Model Deployment**
Let's create a function to make predictions on new text:
```python
def predict_sentiment(text, model, tokenizer, device, max_length=128):
model.eval()
# Clean and tokenize text
cleaned_text = clean_text(text)
inputs = tokenizer(
cleaned_text,
return_tensors="pt",
max_length=max_length,
padding="max_length",
truncation=True
).to(device)
# Get prediction
with torch.no_grad():
outputs = model(
input_ids=inputs['input_ids'],
attention_mask=inputs['attention_mask']
)
probs = torch.softmax(outputs, dim=1)
pred = torch.argmax(probs, dim=1).item()
confidence = probs[0][pred].item()
# Map prediction to label
label = "Positive" if pred == 1 else "Negative"
return {
"text": text,
"cleaned_text": cleaned_text,
"sentiment": label,
"confidence": confidence,
"probabilities": {
"negative": probs[0][0].item(),
"positive": probs[0][1].item()
}
}
# Example usage
result = predict_sentiment(
"I really hated this movie. The plot was boring and the acting was terrible.",
model,
tokenizer,
device
)
print(f"Text: {result['text']}")
print(f"Sentiment: {result['sentiment']} (confidence: {result['confidence']:.2f})")
print(f"Probabilities: Negative: {result['probabilities']['negative']:.2f}, Positive: {result['probabilities']['positive']:.2f}")
```
### **Step 10: Model Optimization for Production**
For production deployment, we can optimize the model:
```python
# 1. Convert to ONNX format
torch.onnx.export(
model,
(
torch.randint(0, 30000, (1, 128)).to(device),
torch.ones(1, 128).to(device)
),
"sentiment_bert.onnx",
input_names=["input_ids", "attention_mask"],
output_names=["logits"],
dynamic_axes={
"input_ids": {0: "batch", 1: "sequence"},
"attention_mask": {0: "batch", 1: "sequence"},
"logits": {0: "batch"}
},
opset_version=11
)
# 2. Quantize the model for faster inference
from torch.quantization import quantize_dynamic
quantized_model = quantize_dynamic(
model, {nn.Linear, nn.LayerNorm}, dtype=torch.qint8
)
# 3. Save quantized model
torch.save(quantized_model.state_dict(), 'quantized_bert_sentiment.pth')
# 4. Benchmark inference speed
import time
def benchmark(model, dataloader, device, num_batches=10):
model.eval()
start_time = time.time()
with torch.no_grad():
for i, batch in enumerate(dataloader):
if i >= num_batches:
break
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
_ = model(input_ids, attention_mask)
elapsed = time.time() - start_time
print(f"Processed {num_batches * dataloader.batch_size} samples in {elapsed:.2f} seconds")
print(f"Average inference time: {elapsed / (num_batches * dataloader.batch_size) * 1000:.2f} ms/sample")
# Benchmark original and quantized models
print("Original model benchmark:")
benchmark(model, test_loader, device)
print("\nQuantized model benchmark:")
benchmark(quantized_model, test_loader, device)
```
### **Expected Results and Improvements**
With this pipeline, you should achieve:
- **~94-95% accuracy** on the IMDB test set
- Clear understanding of model decisions through attention visualization
- Production-ready model with quantization for faster inference
To further improve performance:
1. **Use a larger pre-trained model**:
```python
model = BertForSequenceClassification.from_pretrained('bert-large-uncased')
```
2. **Apply advanced fine-tuning techniques**:
- Gradual unfreezing
- Discriminative learning rates
- Slanted triangular learning rates
3. **Ensemble multiple models**:
```python
# Train multiple models with different seeds
models = [BERTSentimentClassifier() for _ in range(5)]
# Average predictions
def ensemble_predict(text):
probs = []
for model in models:
result = predict_sentiment(text, model, tokenizer, device)
probs.append(result['probabilities'])
avg_neg = np.mean([p['negative'] for p in probs])
avg_pos = np.mean([p['positive'] for p in probs])
return "Positive" if avg_pos > avg_neg else "Negative"
```
4. **Add adversarial training**:
```python
# Generate adversarial examples during training
def add_adversarial_noise(embeddings, epsilon=0.01):
noise = torch.randn_like(embeddings) * epsilon
return embeddings + noise
```
---
## **Quiz 3: Test Your Understanding of NLP with PyTorch**
**1. What is the primary advantage of using subword tokenization over word-level tokenization?**
A) It reduces the vocabulary size significantly
B) It handles out-of-vocabulary words more effectively
C) It preserves the original word order better
D) It requires less computational resources
**2. In a standard LSTM, what is the purpose of the cell state (C_t)?**
A) To serve as the output of the LSTM at each time step
B) To carry long-term information through the network
C) To determine which information to forget from the previous state
D) To normalize the input data before processing
**3. What problem does the attention mechanism primarily solve in sequence-to-sequence models?**
A) The vanishing gradient problem
B) The limitation of fixed-length context vectors for long sequences
C) The high computational cost of training RNNs
D) The difficulty of handling variable-length input sequences
**4. In the Transformer architecture, what is the purpose of positional encoding?**
A) To normalize the input embeddings
B) To provide information about the position of tokens in the sequence
C) To reduce the dimensionality of the input data
D) To create attention masks for the decoder
**5. Which of the following is NOT a component of the Transformer's encoder layer?**
A) Multi-head self-attention
B) Position-wise feed-forward network
C) Masked multi-head attention
D) Residual connections and layer normalization
**6. When fine-tuning BERT for a text classification task, which token's representation is typically used for classification?**
A) The [SEP] token
B) The last token of the sequence
C) The [CLS] token
D) The average of all token representations
**7. What is the main difference between GRU and LSTM?**
A) GRU has more gates than LSTM
B) GRU combines the cell state and hidden state into a single state
C) LSTM does not use gating mechanisms
D) GRU cannot handle long-term dependencies
**8. In the context of NLP, what does "teacher forcing" refer to?**
A) Using the model's own predictions as input during training
B) Using the ground truth output as input during training
C) Training the model with human-annotated data only
D) Forcing the model to use specific attention weights
**9. Which technique would most directly address the problem of exploding gradients in RNN training?**
A) Increasing the learning rate
B) Using a larger batch size
C) Gradient clipping
D) Adding more layers to the network
**10. What is the key innovation of the Transformer architecture compared to previous sequence models?**
A) The use of convolutional layers for text processing
B) Replacing recurrence with self-attention mechanisms
C) Implementing a novel word embedding technique
D) Using reinforcement learning for sequence generation
**11. In BERT, what does the "MLM" (Masked Language Modeling) task involve?**
A) Predicting the next sentence in a pair of sentences
B) Predicting randomly masked tokens in the input
C) Translating text from one language to another
D) Classifying the sentiment of a given text
**12. What is the primary purpose of layer normalization in deep neural networks?**
A) To reduce the number of parameters in the model
B) To make the training process more stable and faster
C) To prevent overfitting by adding noise to the inputs
D) To increase the representational capacity of the network
**13. Which of the following best describes "beam search" in sequence generation?**
A) A greedy approach that selects the highest probability token at each step
B) A technique that keeps multiple hypotheses during decoding
C) A method for training sequence models with reinforcement learning
D) An algorithm for optimizing the learning rate during training
**14. What is the main advantage of using pre-trained language models like BERT for downstream NLP tasks?**
A) They require less data for fine-tuning compared to training from scratch
B) They eliminate the need for task-specific architectures
C) They guarantee state-of-the-art performance on all NLP tasks
D) They are faster to train than traditional RNNs
**15. In the context of attention mechanisms, what does "self-attention" refer to?**
A) Attention between the encoder and decoder in a seq2seq model
B) Attention where the queries, keys, and values all come from the same source
C) Attention that focuses on the most important words in a sentence
D) A specialized attention mechanism for handling long sequences
---
**Answers:**
1. B - Subword tokenization handles OOV words by breaking them into known subwords
2. B - The cell state carries long-term information through the network
3. B - Attention solves the limitation of fixed-length context vectors for long sequences
4. B - Positional encoding provides information about token positions
5. C - Masked attention is only in the decoder, not the encoder
6. C - BERT uses the [CLS] token representation for classification
7. B - GRU combines cell state and hidden state into a single state
8. B - Teacher forcing uses ground truth as input during training
9. C - Gradient clipping directly addresses exploding gradients
10. B - Transformer replaces recurrence with self-attention
11. B - MLM involves predicting randomly masked tokens
12. B - Layer normalization stabilizes and speeds up training
13. B - Beam search keeps multiple hypotheses during decoding
14. A - Pre-trained models require less data for fine-tuning
15. B - Self-attention uses same source for queries, keys, and values
---
## **Summary and What's Next in Part 4**
In this **comprehensive Part 3** of our PyTorch Masterclass, we've covered:
- **Text data processing**: Cleaning, tokenization, and vocabulary building
- **Word embeddings**: From one-hot to contextual representations
- **Recurrent Neural Networks**: Theory and implementation of RNNs, LSTMs, and GRUs
- **Sequence-to-sequence models**: Machine translation with attention
- **Attention mechanisms**: The foundation of modern NLP
- **Transformer architecture**: Complete implementation and understanding
- **Sentiment analysis**: End-to-end pipeline from data to deployment
You now have the skills to:
- Process and represent text for deep learning
- Build and train RNNs, LSTMs, and GRUs for sequence tasks
- Implement attention mechanisms for better context understanding
- Work with Transformer models like BERT for state-of-the-art NLP
- Create complete NLP applications from scratch
### **What's Coming in Part 4?**
In **Part 4**, we'll dive into **Generative Models** with PyTorch:
- **Autoencoders**: Understanding latent space representations
- **Variational Autoencoders (VAEs)**: Probabilistic generative modeling
- **Generative Adversarial Networks (GANs)**: Creating realistic synthetic data
- **Diffusion Models**: The new frontier in generative AI
- **Text-to-Image Generation**: Building models like DALL-E
- **Music and Audio Generation**: Creating novel audio content
- **Evaluating Generative Models**: Metrics and qualitative assessment
We'll build a **complete image generation pipeline** and explore the architectures behind cutting-edge generative AI systems.
👉 **Stay tuned for Part 4: Generative Models with PyTorch**
---
**Hashtags:** #PyTorch #NLP #RNN #LSTM #GRU #Transformers #Attention #NaturalLanguageProcessing #TextClassification #SentimentAnalysis #WordEmbeddings #DeepLearning #MachineLearning #AI #SequenceModeling #BERT #GPT #TextProcessing #PyTorchNLP #HuggingFace #TransformerArchitecture #SelfAttention #MachineTranslation #TextGeneration #NamedEntityRecognition #PartOfSpeechTagging #Tokenization #WordPiece #BytePairEncoding #GloVe #Word2Vec #FastText #Seq2Seq #EncoderDecoder #BeamSearch #TeacherForcing #TextSummarization #QuestionAnswering #TextToText #FineTuning #PretrainedModels #PyTorchTutorial #DeepLearningCourse #AIEngineering #NLPEngineering