Try  HackMD Logo HackMD

PyTorch Masterclass: Part Rest – Deep Learning for Natural Language Processing with PyTorch

Training a Transformer

The training process is similar to seq2seq models:

# Hyperparameters
SRC_VOCAB_SIZE = len(de_vocab)
TRG_VOCAB_SIZE = len(en_vocab)
EMBED_SIZE = 512
NUM_LAYERS = 6
FORWARD_EXPANSION = 4
HEADS = 8
DROPOUT = 0.1
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
MAX_LENGTH = 100
LEARNING_RATE = 0.0005
BATCH_SIZE = 32

# Initialize model
model = Transformer(
    SRC_VOCAB_SIZE,
    TRG_VOCAB_SIZE,
    de_vocab['<pad>'],
    en_vocab['<pad>'],
    EMBED_SIZE,
    NUM_LAYERS,
    FORWARD_EXPANSION,
    HEADS,
    DROPOUT,
    DEVICE,
    MAX_LENGTH
).to(DEVICE)

# Optimizer and loss
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss(ignore_index=en_vocab['<pad>'])

# Training loop (similar to seq2seq)
def train(model, dataloader, optimizer, criterion, device):
    model.train()
    epoch_loss = 0
    
    for src, trg in dataloader:
        src, trg = src.to(device), trg.to(device)
        
        # Shift target for teacher forcing
        output = model(src, trg[:, :-1])
        output = output.reshape(-1, output.shape[2])
        trg = trg[:, 1:].reshape(-1)
        
        optimizer.zero_grad()
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
        optimizer.step()
        
        epoch_loss += loss.item()
    
    return epoch_loss / len(dataloader)

Pre-trained Transformer Models

Instead of training from scratch, we can use pre-trained models:

1. BERT (Bidirectional Encoder Representations from Transformers)

from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

2. GPT (Generative Pre-trained Transformer)

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
logits = outputs.logits

3. T5 (Text-to-Text Transfer Transformer)

from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

input_text = "translate English to German: Hello, my dog is cute"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))

Fine-tuning BERT for Text Classification

Let's fine-tune BERT for sentiment analysis:

from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
).to(device)

# Custom dataset
class BertDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        encoding = self.tokenizer.encode_plus(
            self.texts[idx],
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

# Prepare data
train_texts = [text for _, text in train_data]
train_labels = [1 if label == 'pos' else 0 for label, _ in train_data]

train_dataset = BertDataset(train_texts, train_labels, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# Training setup
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = nn.CrossEntropyLoss().to(device)

# Training loop
for epoch in range(3):
    model.train()
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(
            input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        
        loss = outputs.loss
        loss.backward()
        
        optimizer.step()
        optimizer.zero_grad()
    
    print(f"Epoch {epoch+1} completed")

# Save model
model.save_pretrained('./sentiment-bert')
tokenizer.save_pretrained('./sentiment-bert')

Transformer Variants and Improvements

1. Transformer-XL

Handles longer sequences with segment-level recurrence:

from transformers import TransfoXLModel, TransfoXLTokenizer

tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')

2. Longformer

Uses attention pattern to handle long documents:

from transformers import LongformerModel, LongformerTokenizer

tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model = LongformerModel.from_pretrained('allenai/longformer-base-4096')

3. Reformer

Uses locality-sensitive hashing for efficient attention:

from transformers import ReformerModel, ReformerTokenizer

tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')
model = ReformerModel.from_pretrained('google/reformer-crime-and-punishment')

4. BigBird

Combines random, window and global attention:

from transformers import BigBirdModel, BigBirdTokenizer

tokenizer = BigBirdTokenizer.from_pretrained('google/bigbird-roberta-base')
model = BigBirdModel.from_pretrained('google/bigbird-roberta-base')

Building a Sentiment Analysis Model from Scratch

Let's build a complete sentiment analysis pipeline using the techniques we've learned.

Step 1: Data Preparation

We'll use the IMDB dataset with enhanced preprocessing:

import pandas as pd
from sklearn.model_selection import train_test_split
import re
import string
import torch
from torch.utils.data import Dataset, DataLoader

# Load data
train_data = pd.read_csv('imdb_train.csv')
test_data = pd.read_csv('imdb_test.csv')

# Enhanced text cleaning
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove non-ASCII characters
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply cleaning
train_data['review'] = train_data['review'].apply(clean_text)
test_data['review'] = test_data['review'].apply(clean_text)

# Split training data into train and validation
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_data['review'].values,
    train_data['sentiment'].values,
    test_size=0.1,
    random_state=42
)

Step 2: Advanced Tokenization with Subwords

Let's use BERT's WordPiece tokenizer for better handling of rare words:

from tokenizers import BertWordPieceTokenizer

# Initialize tokenizer
tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=False,
    strip_accents=True,
    lowercase=True
)

# Train tokenizer
tokenizer.train(
    files=["imdb_train.txt"],
    vocab_size=30000,
    min_frequency=2,
    special_tokens=[
        "[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"
    ],
    limit_alphabet=1000,
    wordpieces_prefix="##"
)

# Save tokenizer
tokenizer.save_model("imdb_tokenizer")

# Tokenization function
def encode_texts(texts, max_length=256):
    encodings = tokenizer.encode_batch(
        texts,
        add_special_tokens=True,
        max_length=max_length,
        truncation=True,
        padding="max_length"
    )
    
    input_ids = torch.tensor([e.ids for e in encodings])
    attention_mask = torch.tensor([e.attention_mask for e in encodings])
    
    return input_ids, attention_mask

# Encode data
train_ids, train_mask = encode_texts(train_texts)
val_ids, val_mask = encode_texts(val_texts)
test_ids, test_mask = encode_texts(test_data['review'].values)

# Convert labels
train_labels = torch.tensor(train_labels)
val_labels = torch.tensor(val_labels)
test_labels = torch.tensor(test_data['sentiment'].values)

Step 3: Custom Dataset and DataLoader

class SentimentDataset(Dataset):
    def __init__(self, input_ids, attention_mask, labels):
        self.input_ids = input_ids
        self.attention_mask = attention_mask
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_mask[idx],
            'labels': self.labels[idx]
        }

# Create datasets
train_dataset = SentimentDataset(train_ids, train_mask, train_labels)
val_dataset = SentimentDataset(val_ids, val_mask, val_labels)
test_dataset = SentimentDataset(test_ids, test_mask, test_labels)

# Create data loaders
train_loader = DataLoader(
    train_dataset,
    batch_size=16,
    shuffle=True,
    num_workers=4
)

val_loader = DataLoader(
    val_dataset,
    batch_size=16,
    shuffle=False,
    num_workers=4
)

test_loader = DataLoader(
    test_dataset,
    batch_size=16,
    shuffle=False,
    num_workers=4
)

Step 4: Building the Model with BERT

Let's fine-tune BERT for sentiment analysis:

from transformers import BertModel, BertConfig
import torch.nn as nn

class BERTSentimentClassifier(nn.Module):
    def __init__(self, num_classes=2, dropout=0.1):
        super(BERTSentimentClassifier, self).__init__()
        
        # Load BERT configuration
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        
        # Freeze BERT layers (optional)
        for param in self.bert.parameters():
            param.requires_grad = False
        
        # Classification head
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)
    
    def forward(self, input_ids, attention_mask):
        # Get BERT outputs
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        
        # Use the [CLS] token representation
        pooled_output = outputs.pooler_output
        
        # Classification
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        
        return logits

# Initialize model
model = BERTSentimentClassifier(num_classes=2, dropout=0.1)
model = model.to(device)

# Advanced optimizer with weight decay
from transformers import AdamW
from torch.optim.lr_scheduler import ReduceLROnPlateau

optimizer = AdamW(
    model.parameters(),
    lr=2e-5,
    weight_decay=0.01,
    correct_bias=False
)

# Learning rate scheduler
scheduler = ReduceLROnPlateau(
    optimizer,
    mode='max',
    factor=0.5,
    patience=2,
    verbose=True
)

# Loss function
criterion = nn.CrossEntropyLoss()

Step 5: Advanced Training Techniques

Let's implement advanced training techniques for better performance:

from tqdm import tqdm
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report

def train_epoch(model, dataloader, optimizer, criterion, device, scheduler=None):
    model.train()
    total_loss = 0
    all_preds = []
    all_labels = []
    
    progress_bar = tqdm(dataloader, desc="Training")
    
    for batch in progress_bar:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        
        # Backward pass
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        # Update parameters
        optimizer.step()
        
        # Record metrics
        total_loss += loss.item()
        preds = torch.argmax(outputs, dim=1).cpu().numpy()
        all_preds.extend(preds)
        all_labels.extend(labels.cpu().numpy())
        
        # Update progress bar
        progress_bar.set_postfix(loss=loss.item())
    
    # Calculate epoch metrics
    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds)
    
    return avg_loss, accuracy, f1

def evaluate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    all_preds = []
    all_labels = []
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            # Forward pass
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)
            
            # Record metrics
            total_loss += loss.item()
            preds = torch.argmax(outputs, dim=1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(labels.cpu().numpy())
    
    # Calculate metrics
    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds)
    
    return avg_loss, accuracy, f1, all_preds, all_labels

# Training loop with early stopping
best_val_acc = 0
patience = 3
trigger_times = 0

for epoch in range(10):
    print(f"\nEpoch {epoch+1}/10")
    
    # Train
    train_loss, train_acc, train_f1 = train_epoch(
        model, train_loader, optimizer, criterion, device
    )
    
    # Evaluate
    val_loss, val_acc, val_f1, _, _ = evaluate(
        model, val_loader, criterion, device
    )
    
    # Update learning rate
    scheduler.step(val_acc)
    
    # Print metrics
    print(f"Train Loss: {train_loss:.4f} | Acc: {train_acc:.4f} | F1: {train_f1:.4f}")
    print(f"Val Loss: {val_loss:.4f} | Acc: {val_acc:.4f} | F1: {val_f1:.4f}")
    
    # Early stopping
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        trigger_times = 0
        # Save best model
        torch.save(model.state_dict(), 'best_bert_sentiment.pth')
    else:
        trigger_times += 1
        if trigger_times >= patience:
            print(f"Early stopping at epoch {epoch+1}")
            break

Step 6: Model Evaluation and Analysis

# Load best model
model.load_state_dict(torch.load('best_bert_sentiment.pth'))

# Final evaluation on test set
test_loss, test_acc, test_f1, test_preds, test_labels = evaluate(
    model, test_loader, criterion, device
)

print(f"\nTest Loss: {test_loss:.4f} | Accuracy: {test_acc:.4f} | F1: {test_f1:.4f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(test_labels, test_preds, target_names=['Negative', 'Positive']))

# Confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(test_labels, test_preds)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

Step 7: Error Analysis

Let's analyze where the model makes mistakes:

def analyze_errors(model, dataloader, texts, labels, device):
    model.eval()
    error_indices = []
    
    with torch.no_grad():
        for i, batch in enumerate(dataloader):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            batch_labels = batch['labels'].to(device)
            
            outputs = model(input_ids, attention_mask)
            preds = torch.argmax(outputs, dim=1)
            
            # Find incorrect predictions
            incorrect = (preds != batch_labels).nonzero().squeeze().tolist()
            
            if isinstance(incorrect, int):
                incorrect = [incorrect]
                
            # Store indices of errors
            for idx in incorrect:
                global_idx = i * dataloader.batch_size + idx
                if global_idx < len(texts):
                    error_indices.append(global_idx)
    
    # Print some error examples
    print("\nError Analysis (first 5 examples):")
    for i in error_indices[:5]:
        print(f"\nText: {texts[i][:200]}...")
        print(f"True label: {labels[i]} | Predicted: {test_preds[i]}")
        print("-" * 50)
    
    return error_indices

# Perform error analysis
error_indices = analyze_errors(
    model, test_loader, test_data['review'].values, test_labels, device
)

Step 8: Model Interpretation with Attention Visualization

Let's visualize attention to understand what the model focuses on:

from IPython.display import display, HTML

def visualize_attention(model, text, tokenizer, device):
    model.eval()
    
    # Tokenize input
    inputs = tokenizer(
        text, 
        return_tensors="pt",
        max_length=128,
        padding="max_length",
        truncation=True
    ).to(device)
    
    # Get model outputs with attention
    with torch.no_grad():
        outputs = model.bert(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            output_attentions=True
        )
    
    # Get attention from last layer
    attentions = outputs.attentions[-1][0]  # [heads, seq_len, seq_len]
    
    # Average over heads
    avg_attention = torch.mean(attentions, dim=0).cpu().numpy()
    
    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0].cpu().numpy())
    
    # Remove special tokens for display
    display_tokens = []
    display_attention = []
    
    for i, token in enumerate(tokens):
        if token not in ["[CLS]", "[SEP]", "[PAD]"]:
            display_tokens.append(token)
            display_attention.append(avg_attention[i, :len(display_tokens)])
    
    # Create HTML visualization
    html = """
    <p><strong>Input text:</strong> %s</p>
    <p><strong>Model attention visualization:</strong></p>
    """ % text
    
    for i, token in enumerate(display_tokens):
        # Calculate attention strength for this token
        strength = np.mean(display_attention[:i+1, i])
        intensity = int(50 + 150 * strength)
        color = f"rgba(100, 150, 255, {strength:.2f})"
        
        html += f'<span style="background-color: {color}; padding: 2px 4px; margin: 0 1px; border-radius: 3px;">{token}</span> '
    
    display(HTML(html))

# Example usage
sample_text = "This movie was absolutely fantastic! The acting was superb and the plot was engaging."
visualize_attention(model, sample_text, tokenizer, device)

This will display the text with each token highlighted according to how much attention the model paid to it, helping us understand which words influenced the prediction.

Step 9: Model Deployment

Let's create a function to make predictions on new text:

def predict_sentiment(text, model, tokenizer, device, max_length=128):
    model.eval()
    
    # Clean and tokenize text
    cleaned_text = clean_text(text)
    inputs = tokenizer(
        cleaned_text,
        return_tensors="pt",
        max_length=max_length,
        padding="max_length",
        truncation=True
    ).to(device)
    
    # Get prediction
    with torch.no_grad():
        outputs = model(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask']
        )
        probs = torch.softmax(outputs, dim=1)
        pred = torch.argmax(probs, dim=1).item()
        confidence = probs[0][pred].item()
    
    # Map prediction to label
    label = "Positive" if pred == 1 else "Negative"
    
    return {
        "text": text,
        "cleaned_text": cleaned_text,
        "sentiment": label,
        "confidence": confidence,
        "probabilities": {
            "negative": probs[0][0].item(),
            "positive": probs[0][1].item()
        }
    }

# Example usage
result = predict_sentiment(
    "I really hated this movie. The plot was boring and the acting was terrible.",
    model,
    tokenizer,
    device
)

print(f"Text: {result['text']}")
print(f"Sentiment: {result['sentiment']} (confidence: {result['confidence']:.2f})")
print(f"Probabilities: Negative: {result['probabilities']['negative']:.2f}, Positive: {result['probabilities']['positive']:.2f}")

Step 10: Model Optimization for Production

For production deployment, we can optimize the model:

# 1. Convert to ONNX format
torch.onnx.export(
    model,
    (
        torch.randint(0, 30000, (1, 128)).to(device),
        torch.ones(1, 128).to(device)
    ),
    "sentiment_bert.onnx",
    input_names=["input_ids", "attention_mask"],
    output_names=["logits"],
    dynamic_axes={
        "input_ids": {0: "batch", 1: "sequence"},
        "attention_mask": {0: "batch", 1: "sequence"},
        "logits": {0: "batch"}
    },
    opset_version=11
)

# 2. Quantize the model for faster inference
from torch.quantization import quantize_dynamic

quantized_model = quantize_dynamic(
    model, {nn.Linear, nn.LayerNorm}, dtype=torch.qint8
)

# 3. Save quantized model
torch.save(quantized_model.state_dict(), 'quantized_bert_sentiment.pth')

# 4. Benchmark inference speed
import time

def benchmark(model, dataloader, device, num_batches=10):
    model.eval()
    start_time = time.time()
    
    with torch.no_grad():
        for i, batch in enumerate(dataloader):
            if i >= num_batches:
                break
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            _ = model(input_ids, attention_mask)
    
    elapsed = time.time() - start_time
    print(f"Processed {num_batches * dataloader.batch_size} samples in {elapsed:.2f} seconds")
    print(f"Average inference time: {elapsed / (num_batches * dataloader.batch_size) * 1000:.2f} ms/sample")

# Benchmark original and quantized models
print("Original model benchmark:")
benchmark(model, test_loader, device)

print("\nQuantized model benchmark:")
benchmark(quantized_model, test_loader, device)

Expected Results and Improvements

With this pipeline, you should achieve:

  • ~94-95% accuracy on the IMDB test set
  • Clear understanding of model decisions through attention visualization
  • Production-ready model with quantization for faster inference

To further improve performance:

  1. Use a larger pre-trained model:

    ​​​model = BertForSequenceClassification.from_pretrained('bert-large-uncased')
    
  2. Apply advanced fine-tuning techniques:

    • Gradual unfreezing
    • Discriminative learning rates
    • Slanted triangular learning rates
  3. Ensemble multiple models:

    ​​​# Train multiple models with different seeds
    ​​​models = [BERTSentimentClassifier() for _ in range(5)]
    ​​​
    ​​​# Average predictions
    ​​​def ensemble_predict(text):
    ​​​    probs = []
    ​​​    for model in models:
    ​​​        result = predict_sentiment(text, model, tokenizer, device)
    ​​​        probs.append(result['probabilities'])
    ​​​    avg_neg = np.mean([p['negative'] for p in probs])
    ​​​    avg_pos = np.mean([p['positive'] for p in probs])
    ​​​    return "Positive" if avg_pos > avg_neg else "Negative"
    
  4. Add adversarial training:

    ​​​# Generate adversarial examples during training
    ​​​def add_adversarial_noise(embeddings, epsilon=0.01):
    ​​​    noise = torch.randn_like(embeddings) * epsilon
    ​​​    return embeddings + noise
    

Quiz 3: Test Your Understanding of NLP with PyTorch

1. What is the primary advantage of using subword tokenization over word-level tokenization?

A) It reduces the vocabulary size significantly
B) It handles out-of-vocabulary words more effectively
C) It preserves the original word order better
D) It requires less computational resources

2. In a standard LSTM, what is the purpose of the cell state (C_t)?

A) To serve as the output of the LSTM at each time step
B) To carry long-term information through the network
C) To determine which information to forget from the previous state
D) To normalize the input data before processing

3. What problem does the attention mechanism primarily solve in sequence-to-sequence models?

A) The vanishing gradient problem
B) The limitation of fixed-length context vectors for long sequences
C) The high computational cost of training RNNs
D) The difficulty of handling variable-length input sequences

4. In the Transformer architecture, what is the purpose of positional encoding?

A) To normalize the input embeddings
B) To provide information about the position of tokens in the sequence
C) To reduce the dimensionality of the input data
D) To create attention masks for the decoder

5. Which of the following is NOT a component of the Transformer's encoder layer?

A) Multi-head self-attention
B) Position-wise feed-forward network
C) Masked multi-head attention
D) Residual connections and layer normalization

6. When fine-tuning BERT for a text classification task, which token's representation is typically used for classification?

A) The [SEP] token
B) The last token of the sequence
C) The [CLS] token
D) The average of all token representations

7. What is the main difference between GRU and LSTM?

A) GRU has more gates than LSTM
B) GRU combines the cell state and hidden state into a single state
C) LSTM does not use gating mechanisms
D) GRU cannot handle long-term dependencies

8. In the context of NLP, what does "teacher forcing" refer to?

A) Using the model's own predictions as input during training
B) Using the ground truth output as input during training
C) Training the model with human-annotated data only
D) Forcing the model to use specific attention weights

9. Which technique would most directly address the problem of exploding gradients in RNN training?

A) Increasing the learning rate
B) Using a larger batch size
C) Gradient clipping
D) Adding more layers to the network

10. What is the key innovation of the Transformer architecture compared to previous sequence models?

A) The use of convolutional layers for text processing
B) Replacing recurrence with self-attention mechanisms
C) Implementing a novel word embedding technique
D) Using reinforcement learning for sequence generation

11. In BERT, what does the "MLM" (Masked Language Modeling) task involve?

A) Predicting the next sentence in a pair of sentences
B) Predicting randomly masked tokens in the input
C) Translating text from one language to another
D) Classifying the sentiment of a given text

12. What is the primary purpose of layer normalization in deep neural networks?

A) To reduce the number of parameters in the model
B) To make the training process more stable and faster
C) To prevent overfitting by adding noise to the inputs
D) To increase the representational capacity of the network

13. Which of the following best describes "beam search" in sequence generation?

A) A greedy approach that selects the highest probability token at each step
B) A technique that keeps multiple hypotheses during decoding
C) A method for training sequence models with reinforcement learning
D) An algorithm for optimizing the learning rate during training

14. What is the main advantage of using pre-trained language models like BERT for downstream NLP tasks?

A) They require less data for fine-tuning compared to training from scratch
B) They eliminate the need for task-specific architectures
C) They guarantee state-of-the-art performance on all NLP tasks
D) They are faster to train than traditional RNNs

15. In the context of attention mechanisms, what does "self-attention" refer to?

A) Attention between the encoder and decoder in a seq2seq model
B) Attention where the queries, keys, and values all come from the same source
C) Attention that focuses on the most important words in a sentence
D) A specialized attention mechanism for handling long sequences


Answers:

  1. B - Subword tokenization handles OOV words by breaking them into known subwords
  2. B - The cell state carries long-term information through the network
  3. B - Attention solves the limitation of fixed-length context vectors for long sequences
  4. B - Positional encoding provides information about token positions
  5. C - Masked attention is only in the decoder, not the encoder
  6. C - BERT uses the [CLS] token representation for classification
  7. B - GRU combines cell state and hidden state into a single state
  8. B - Teacher forcing uses ground truth as input during training
  9. C - Gradient clipping directly addresses exploding gradients
  10. B - Transformer replaces recurrence with self-attention
  11. B - MLM involves predicting randomly masked tokens
  12. B - Layer normalization stabilizes and speeds up training
  13. B - Beam search keeps multiple hypotheses during decoding
  14. A - Pre-trained models require less data for fine-tuning
  15. B - Self-attention uses same source for queries, keys, and values

Summary and What's Next in Part 4

In this comprehensive Part 3 of our PyTorch Masterclass, we've covered:

  • Text data processing: Cleaning, tokenization, and vocabulary building
  • Word embeddings: From one-hot to contextual representations
  • Recurrent Neural Networks: Theory and implementation of RNNs, LSTMs, and GRUs
  • Sequence-to-sequence models: Machine translation with attention
  • Attention mechanisms: The foundation of modern NLP
  • Transformer architecture: Complete implementation and understanding
  • Sentiment analysis: End-to-end pipeline from data to deployment

You now have the skills to:

  • Process and represent text for deep learning
  • Build and train RNNs, LSTMs, and GRUs for sequence tasks
  • Implement attention mechanisms for better context understanding
  • Work with Transformer models like BERT for state-of-the-art NLP
  • Create complete NLP applications from scratch

What's Coming in Part 4?

In Part 4, we'll dive into Generative Models with PyTorch:

  • Autoencoders: Understanding latent space representations
  • Variational Autoencoders (VAEs): Probabilistic generative modeling
  • Generative Adversarial Networks (GANs): Creating realistic synthetic data
  • Diffusion Models: The new frontier in generative AI
  • Text-to-Image Generation: Building models like DALL-E
  • Music and Audio Generation: Creating novel audio content
  • Evaluating Generative Models: Metrics and qualitative assessment

We'll build a complete image generation pipeline and explore the architectures behind cutting-edge generative AI systems.

👉 Stay tuned for Part 4: Generative Models with PyTorch


Hashtags: #PyTorch #NLP #RNN #LSTM #GRU #Transformers #Attention #NaturalLanguageProcessing #TextClassification #SentimentAnalysis #WordEmbeddings #DeepLearning #MachineLearning #AI #SequenceModeling #BERT #GPT #TextProcessing #PyTorchNLP #HuggingFace #TransformerArchitecture #SelfAttention #MachineTranslation #TextGeneration #NamedEntityRecognition #PartOfSpeechTagging #Tokenization #WordPiece #BytePairEncoding #GloVe #Word2Vec #FastText #Seq2Seq #EncoderDecoder #BeamSearch #TeacherForcing #TextSummarization #QuestionAnswering #TextToText #FineTuning #PretrainedModels #PyTorchTutorial #DeepLearningCourse #AIEngineering #NLPEngineering