Introduction to NLP using Hugging Face - HackMD

- - Sharing URL
  - /edit
  - View mode
    
    Edit mode
    
    View mode
    
    Book mode
    
    Slide mode
    Edit mode View mode Book mode Slide mode
  - Customize slides
  - Note Permission
  - Read
    Only me
    
    Signed-in users
    
    Everyone
    Only me Signed-in users Everyone
  - Write
    Only me
    
    Signed-in users
    
    Everyone
    Only me Signed-in users Everyone
  - Engagement control Commenting, Suggest edit, Emoji Reply
- Invite by email
  
  Invitee
  
  This note has no invitees
- Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note
  
  Your note will be visible on your profile and discoverable by anyone.
  
  Your note is now live.
  
  This note is visible on your profile and discoverable online.
  
  Everyone on the web can find and read all notes of this public team.
  
  See published notes
  
  Unpublish note
  
  I agree to HackMD’s Community Guideline. Please check the box to agree to the Community Guidelines.
  
  View profile
- Commenting
  Permission
  Disabled Forbidden Owners Signed-in users Everyone
- Enable
- Permission
  Forbidden
  
  Owners
  
  Signed-in users
  
  Everyone
- Suggest edit
  
  Permission
  Disabled Forbidden Owners Signed-in users Everyone
- Enable
- Permission
  Forbidden
  
  Owners
  
  Signed-in users
- Emoji Reply
- Enable

Owned this note Owned this note

<style> .icon{ position: absolute; bottom: -20; left: 0; width: 400px } </style> # Introduction to NLP with Hugging Face >[name=醫學一蕭名凱][time=Mar 6, 2025] <img class="icon" src="https://hackmd.io/_uploads/B1g7Bq_qJe.png"> ---- # 自我介紹 <img src="https://hackmd.io/_uploads/HyACJlv5Je.jpg" alt="drawing" width="400"/> ---- > [name=臺中人蕭名凱] - TCFSH TCIRC 第39屆副社長 https://tcirc.tw/ - GDSC Core Team Member 2024-now <img src="https://hackmd.io/_uploads/S1hAmxD91g.png" width=50> https://gdg.community.dev/gdg-on-campus-taipei-medical-university-taipei-taiwan/ - Developer/Maintainer of College TW <img src="https://college-tw.hsiaoeric.org/static/apply/images/college.tw_bg_removed2.png" width=50> https://college-tw.hsiaoeric.org/ --- 簡報qr code: ![image](https://hackmd.io/_uploads/HyC-oxPikx.png) --- # Introduction [Slido](https://wall.sli.do/event/qEQwebvuSpz2uAXbfKAGeK?section=0566e76a-32d3-4e07-be77-7b20af7ca274) ![圖片](https://hackmd.io/_uploads/HJqrYBvqJl.png) ---- ## TODOs: - setting up development environment: Python, VS code <img src="https://code.visualstudio.com/assets/favicon.ico" width=50> , Jupyter notebook<img src="https://hackmd.io/_uploads/S1hBk7D5Jx.png" width=70> - learning about NLP, transformers architecture - hand-ons with HF packages - integration with Telegram (Bonus?) --- ## Setting up <img src="https://hackmd.io/_uploads/ByHBMQv9Jx.png" width=100> ---- ## Download Python ![圖片](https://hackmd.io/_uploads/SkQpGmP9Jx.png) https://www.python.org/downloads/ ---- ## 確認在Windows的安裝路徑：選擇Customize installation ![圖片](https://hackmd.io/_uploads/Bk17Ymw9kg.png) (為了方便指令執行python) ---- ## Next ![圖片](https://hackmd.io/_uploads/rJy_F7vckg.png) (為了方便指令執行python) ---- ## 先把安裝位置`Ctrl+c` ![圖片](https://hackmd.io/_uploads/Hy7ot7Dq1e.png) 然後install (為了方便指令執行python) ---- ## press `Windows` then type "系統變數" ![圖片](https://hackmd.io/_uploads/SkuM9QP9kg.png =60%x) (為了方便指令執行python) ---- ## 編輯系統變數`Path` ![圖片](https://hackmd.io/_uploads/ByKj5XD5ye.png =70%x) (為了方便指令執行python) ---- ## 新增 then `Ctrl+v` ![圖片](https://hackmd.io/_uploads/r1qZoXwcyl.png) (為了方便指令執行python) ---- ## 如下圖表示成功～ ![圖片](https://hackmd.io/_uploads/ByBiiQvcJe.png) - Linux: `python3` - Windows: `python` --- ## Required Python packages `pip install "transformers[sentencepiece]"` - transformers, datasets, evaluate ![圖片](https://hackmd.io/_uploads/HJqoZVDcJg.png =70x) - tensorflow<2.11, tf_keras ![圖片](https://hackmd.io/_uploads/SklQfGVw5Jx.png =70x) - torch, torchvision, torchaudio ![圖片](https://hackmd.io/_uploads/SJ2cG4w51l.png =70x) - not include (bonus?): python-telegram-bot ![圖片](https://hackmd.io/_uploads/ryMSG4wqkl.png =70x) --- ## Setting up VS code <img src="https://code.visualstudio.com/assets/favicon.ico" width=100> , <img src="https://hackmd.io/_uploads/S1hBk7D5Jx.png" width=100> ---- ![圖片](https://hackmd.io/_uploads/BJ2lrxP9yx.png) https://code.visualstudio.com/ ---- ## Install and open ![圖片](https://hackmd.io/_uploads/r1E5FlwqJx.png) select folder -> create a __himom.ipynb__ ---- ## Select kernel ![Screenshot 2025-02-22 at 6.04.21 PM](https://hackmd.io/_uploads/ryCqW7D9yx.png) ---- ## Install Jupyter kernel ![圖片](https://hackmd.io/_uploads/rJ_spGw51x.png) ps. 會花一點時間 ---- ## Google Colab ![圖片](https://hackmd.io/_uploads/rJSkwePcyx.png) https://colab.research.google.com/ --- ## Useful hotkeys - run focused cell: `⌘/Ctrl+Enter` - for Colab - insert cell: `⌘/Ctrl+m` + `b` - delete cell(Colab): `⌘/Ctrl+m` + `d` - run CLI: `![command]` - for VS code - 加油～要自訂） --- ## What is NLP ---- ## 自然語言處理 natural language processing - linguistics and machine learning - everything about human language ---- ## Common tasks - Classifying whole sentences - Classifying each word in a sentence - Generating text content - Extracting an answer from a text - Generating a new sentence from an input text - speech recognition and computer vision ---- ## Challenging? - Computers don’t process information in the same way as humans - Computers process information in __numbers__ --- ## What can Transformers do ---- ### Transformer, deep learning architecture - It is used to solve all kinds of NLP tasks - Some of the companies, organizations using it: ![圖片](https://hackmd.io/_uploads/ry91Y4wcJe.png) ---- ## Transformers library provides the functionality to create and use those shared models ---- The most basic object: `pipeline()` ``` from transformers import pipeline classifier = pipeline("sentiment-analysis") classifier("Hi, mom.") ``` ps. to run the cell: `Ctrl+Enter` pps. for multiple inputs: use `[a, b, c, ...]` ---- output: ``` # 單一輸入： [{'label': 'POSITIVE', 'score': 0.9598047137260437}] # 多輸入的話： [{'label': 'POSITIVE', 'score': 0.9598047137260437}, {'label': 'NEGATIVE', 'score': 0.9994558095932007}] ``` ---- 不過第一次跑會需要等一下 ![圖片](https://hackmd.io/_uploads/B1Rv-BPqyx.png) ---- - By default, `pipeline` selects a particular __pretrained model__ that has been __fine-tuned__ for __sentiment analysis__ in English. - When you run: `pipeline("sentiment-analysis")` for the first time model was downloaded and cached ---- ## 其他`pipeline["選項"]` - `feature-extraction` (get the vector representation of a text) - `fill-mask` - `ner` (named entity recognition) - `question-answering` - `sentiment-analysis` - `summarization` - `text-generation` - `translation` - `zero-shot-classification` ---- ## Zero-shot classification ``` classifier = pipeline("zero-shot-classification") classifier( "This is a course about the Transformers library", candidate_labels=["education", "politics", "business"], ) ``` ---- output: ``` {'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445963859558105, 0.111976258456707, 0.043427448719739914]} ``` ---- ## Text generation ``` generator = pipeline("text-generation") generator("Hi, mom.") ``` ---- output: ``` [{'generated_text': 'In this course, we will teach you how to understand and use ' 'data flow and data interchange when handling user data. We ' 'will be working with one or more of the most commonly used ' 'data flows — data flows of various types, as seen by the ' 'HTTP'}] ``` 其他參數： `num_return_sequences` `max_length` ---- ## etc... https://huggingface.co/learn/nlp-course/chapter1/3 --- ## Transformers ---- General architecture: ![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_blocks-dark.svg) ---- + Encoder: + receives an input and builds a representation of it (its features) + optimized to understand the input + Decoder: + uses the encoder’s representation (features) and other inputs to generate a target sequence + optimized for generating outputs ---- ### Transformers categories + Auto-regressive = decoder-only: [GPT](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)-like + Auto-encoding = encoder-only: [BERT](https://arxiv.org/abs/1810.04805)-like + sequence-to-sequence: [BART](https://arxiv.org/abs/1910.13461)/[T5](https://arxiv.org/abs/1910.10683)-like the above examples all LLM ---- LLM(large language model): + self-supervised: + objective automatically computed from the inputs + no need for manual labeling + large amounts of raw text: + developing statistical understanding + not very useful for specific __practical tasks__ ---- An example of task: __causal language modelling__ ![image](https://hackmd.io/_uploads/S1cScST9kl.png) ---- Another one: __mask language modelling__ ![image](https://hackmd.io/_uploads/rki4M6Aq1x.png) ---- __Large__ LM: ![image](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/model_parameters.png) ---- Pretraining: training a model from scratch -- __weights are randomly initialized__ ![image](https://hackmd.io/_uploads/HkOzE609yl.png) ---- Evironmental cost: ![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/carbon_footprint-dark.svg) ---- ### Transfer learning(Fine-tuning) based on pretrained model ---- To perform fine-tuning: 1. acquire a pretrained language model 2. train with a dataset specific to your task: possibly __supervised training__ ---- Original Transformer architecture: ![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers-dark.svg) ---- recommend StatQuest! {%youtube zxQyTK8quyY %} ---- + Attention layers: + pay specific attention to certain words + more or less ignore the others the first paper introduce Transformer: [Attention is All You Need](https://arxiv.org/abs/1706.03762) (published in 2017 by researchers at Google) ---- ### Bias and limitations ``` from transformers import pipeline unmasker = pipeline("fill-mask", model="bert-base-uncased") result = unmasker("This man works as a [MASK].") print([r["token_str"] for r in result]) result = unmasker("This woman works as a [MASK].") print([r["token_str"] for r in result]) ``` ---- output: ``` ['lawyer', 'carpenter', 'doctor', 'waiter', 'mechanic'] ['nurse', 'waitress', 'teacher', 'maid', 'prostitute'] ``` > beware of generate sexist, racist, or homophobic content when using these tools ---  ### In Terms of a model + <ins>Architecture</ins>: definition of layers and operations + <ins>Checkpoints</ins>: weights loaded given the architecture + <ins>Model</ins>: umbrella term that can mean both For example: + GPT-2 is an architecture from OpenAI + `sst-gpt2` is a checkpoint - a set of weights trained by someone with Stanford Sentiment Treebank datasets --- ## Behind the pipeline ![image](https://hackmd.io/_uploads/rkxHrlyiJg.png) ---- ### Preprocessing: Tokenizer ``` from transformers import AutoTokenizer checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint) raw_inputs = [ "I've been waiting for a HuggingFace course my whole life.", "Hi, mom!", ] inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") ``` ---- output: ``` { 'input_ids': tensor([ [ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [ 101, 7632, 1010, 3566, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ]), 'attention_mask': tensor([ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) } ``` ---- ### Going through the model ``` from transformers import AutoModel model = AutoModel.from_pretrained(checkpoint) outputs = model(**inputs) print(outputs.last_hidden_state.shape) ``` output: ``` torch.Size([2, 16, 768]) ``` ---- ### A high-dimensional vector? + Batch size: The number of sequences processed at a time (2 in our example). + Sequence length: The length of the numerical representation of the sequence (16 in our example). + Hidden size: The vector dimension of each model input. ---- ### Model heads making sense out of numbers ``` from transformers import AutoModelForSequenceClassification checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" model = AutoModelForSequenceClassification.from_pretrained(checkpoint) outputs = model(**inputs) print(outputs.logits.shape) ``` ex: AutoModel`ForSequenceClassification` output: ``` torch.Size([2, 2]) ``` ---- ### logits ``` print(outputs.logits) ``` output: ``` tensor([[-1.5607, 1.6123], [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>) ``` ---- ### Softmax function ``` import torch predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) print(predictions) ``` ![image](https://hackmd.io/_uploads/SJZ7uIkjyg.png =400x) ``` tensor([[4.0195e-02, 9.5980e-01], [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>) ``` ---- ### Softmax function with temperature ![image](https://hackmd.io/_uploads/rkNDOxxs1e.png) ---- ``` model.config.id2label ``` output: ``` {0: 'NEGATIVE', 1: 'POSITIVE'} ``` ---- ![image](https://hackmd.io/_uploads/Sk3jVLyi1x.png) --- ## Tokenizer 1. tokenize 2. convert to ids 3. pad, truncate, produce attention masks ---- 1. tokenize ``` sequence = "Using a Transformer network is simple" tokens = tokenizer.tokenize(sequence) print(tokens) ``` output: ``` ['Using', 'a', 'transform', '##er', 'network', 'is', 'simple'] ``` ---- 2. From tokens to input IDs ``` ids = tokenizer.convert_tokens_to_ids(tokens) print(ids) ``` output: ``` [7993, 170, 11303, 1200, 2443, 1110, 3014] ``` ---- 2. -> 1. decoding ``` decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014]) print(decoded_string) ``` output: ``` 'Using a Transformer network is simple' ``` ---- 3. handling multiple sequence ``` sequence = "I've been waiting for a HuggingFace course my whole life." tokens = tokenizer.tokenize(sequence) ids = tokenizer.convert_tokens_to_ids(tokens) input_ids = torch.tensor(ids) # This line will fail. model(input_ids) ``` output: ``` IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1) ``` ---- ### Why because Transformers models expect multiple sentences + the `tokenizer` didn’t just + convert the list of input IDs into a tensor + but also __added a dimension__ on top of it ---- ### Hence ``` input_ids = torch.tensor([ids]) #this "[]" is needed ``` output: ``` Logits: [[-2.7276, 2.8789]] ``` ---- ### Batching + the act of sending multiple sentences through the model + maximizing the utilization of computational resources like GPUs > but with issues ---- ``` batched_ids = [ [200, 200, 200], [200, 200] ] # This will not work outputs = model(torch.tensor(batched_ids) ``` they need to be of rectangular shape, like this: ``` batched_ids = [ [200, 200, 200], [200, 200, tokenizer.pad_token_id], ] attention_mask = [ [1, 1, 1], [1, 1, 0], ] ``` ---- ``` outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask)) print(outputs.logits) ``` output: ``` tensor([[ 1.5694, -1.3895], [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>) ``` --- ## Fine-tuning (preprocessing part) ---- Same: ``` import torch from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification # Same as before checkpoint = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForSequenceClassification.from_pretrained(checkpoint) sequences = [ "I've been waiting for a HuggingFace course my whole life.", "Hi, mom!", ] ``` ---- Preprocessing and training: ``` batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt") # This is new batch["labels"] = torch.tensor([1, 1]) optimizer = AdamW(model.parameters()) loss = model(**batch).loss loss.backward() optimizer.step() ``` ---- Loading a dataset from the Hub: ``` from datasets import load_dataset raw_datasets = load_dataset("glue", "mrpc") raw_datasets ``` ---- ``` DatasetDict({ train: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 3668 }) validation: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 408 }) test: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 1725 }) }) ``` ---- ``` raw_train_dataset = raw_datasets["train"] raw_train_dataset[0] ``` output: ``` {'idx': 0, 'label': 1, 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'} ``` ---- ``` raw_train_dataset.features ``` output: ``` 'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)} ``` ---- ``` raw_train_dataset.features ``` output: ``` {'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)} ``` ---- ### Preprocessing a dataset ```python= from transformers import AutoTokenizer checkpoint = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(checkpoint) # This will not work tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"]) tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"]) ``` ---- ### Handling two sentences as a pair: ```python= inputs = tokenizer("This is the first sentence.", "This is the second one.") inputs ``` ```python= { 'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] } ``` `token_type_ids`? ---- Decoding to figure it out: ``` tokenizer.convert_ids_to_tokens(inputs["input_ids"]) ``` ```python= ['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]'] # align with token type ids [ 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1] ``` ---- ```python= tokenized_dataset = tokenizer( raw_datasets["train"]["sentence1"], raw_datasets["train"]["sentence2"], padding=True, truncation=True, ) ``` Only works if you have enough RAM to store your whole dataset: ---- `Dataset.map()` ```python= def tokenize_function(example): return tokenizer(example["sentence1"], example["sentence2"], truncation=True) ``` ```python= tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) tokenized_datasets ``` ---- Datasets library applies this processing is by adding new fields: `input_ids`, `attention_mask`, `token_type_ids` output: ``` DatasetDict({ train: Dataset({ features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'], num_rows: 3668 }) validation: Dataset({ features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'], num_rows: 408 }) test: Dataset({ features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'], num_rows: 1725 }) }) ``` ---- ### Dynamic padding + collate function: put together samples inside a batch ---- ```python= samples = tokenized_datasets["train"][:8] samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]} [len(x) for x in samples["input_ids"]] ``` output: ``` [50, 59, 47, 67, 59, 50, 62, 32] ``` varying length, from 32 to 67 ---- Use `DataCollatorWithPadding`: ``` from transformers import DataCollatorWithPadding data_collator = DataCollatorWithPadding(tokenizer=tokenizer) batch = data_collator(samples) {k: v.shape for k, v in batch.items()} ``` output: ``` {'attention_mask': torch.Size([8, 67]), 'input_ids': torch.Size([8, 67]), 'token_type_ids': torch.Size([8, 67]), 'labels': torch.Size([8])} ``` --- ## Fine-tuning (training part) ---- ### Trainer API for training and evaluation ---- `TrainingArguments` contains all the hyperparameters for Trainer ``` from transformers import TrainingArguments # For now, just provide the directory path to save the model training_args = TrainingArguments("test-trainer") ``` ---- Prepare the pretrained model: ``` from transformers import AutoModelForSequenceClassification checkpoint = "bert-base-uncased" model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) ``` ps. 有警告是正常的 Because: + BERT has not been pretrained on classifying pairs of sentences + a new head suitable for sequence classification has been added ---- ```python= from transformers import Trainer trainer = Trainer( model, training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], data_collator=data_collator, tokenizer=tokenizer, ) trainer.train() ``` ---- ### No percentage? telling you how well (or badly) your model is performing ---- ## Evaluation ``` predictions = trainer.predict(tokenized_datasets["validation"]) print(predictions.predictions.shape, predictions.label_ids.shape) ``` ``` (408, 2) (408,) ``` ---- ``` import numpy as np preds = np.argmax(predictions.predictions, axis=-1) ``` ``` import evaluate metric = evaluate.load("glue", "mrpc") metric.compute(predictions=preds, references=predictions.label_ids) ``` ``` {'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542} ``` ---- Wrapping together: `compute_metrics()` ```python= def compute_metrics(eval_preds): metric = evaluate.load("glue", "mrpc") logits, labels = eval_preds predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels) ``` ---- ```python= trainer = Trainer( model, training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics, ) trainer.train() ``` ---- ``` {'eval_loss': 0.5678262114524841, 'eval_accuracy': 0.7107843137254902, 'eval_f1': 0.8254437869822485, 'eval_runtime': 4.4943, 'eval_samples_per_second': 90.782, 'eval_steps_per_second': 11.348, 'epoch': 1.0} {'loss': 0.6119, 'grad_norm': 2.493729591369629, 'learning_rate': 3.184458968772695e-05, 'epoch': 1.09} {'eval_loss': 0.4424363076686859, 'eval_accuracy': 0.8357843137254902, 'eval_f1': 0.8818342151675485, 'eval_runtime': 5.1014, 'eval_samples_per_second': 79.977, 'eval_steps_per_second': 9.997, 'epoch': 2.0} {'loss': 0.5049, 'grad_norm': 5.123687267303467, 'learning_rate': 1.3689179375453886e-05, 'epoch': 2.18} ... ``` --- ## Deploying on Telegram(bonus?) ---- ## BotFather ![image](https://hackmd.io/_uploads/BkDckIWjJx.png) ![image](https://hackmd.io/_uploads/SyNxgU-okl.png =500x) ---- 進口所需套件 ```python= import logging from telegram import Update from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler ``` ---- Setting up logging module: ```python= logging.basicConfig( # 輸出log的格式 format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', # 選擇logging的等級下限 level=logging.INFO ) ``` ---- ### 'TOKEN' ![image](https://hackmd.io/_uploads/r1X7kUWsJl.png =500x) ``` application = ApplicationBuilder().token('TOKEN').build() ``` ---- 定義收到特定指令要做的事： `/start` (使用者輸入) ```python= async def start(update: Update, context: ContextTypes.DEFAULT_TYPE): await context.bot.send_message( chat_id=update.effective_chat.id, text="I'm a bot, please talk to me!") ``` ---- 整合在一起： ```python= if __name__ == '__main__': application = ApplicationBuilder().token('TOKEN').build() # 把前面定義的start()註冊到application start_handler = CommandHandler('start', start) application.add_handler(start_handler) # runs the bot until you hit CTRL+C application.run_polling() ``` ---- For regular message: ```python= async def echo(update: Update, context: ContextTypes.DEFAULT_TYPE): await context.bot.send_message( chat_id=update.effective_chat.id, text=update.message.text) ``` ---- 整合在一起： ```python= if __name__ == '__main__': ... echo_handler = MessageHandler(filters.TEXT & (~filters.COMMAND), echo) application.add_handler(start_handler) application.add_handler(echo_handler) application.run_polling() ``` --- ## Reference - Hugging Face(HF) NLP course: https://huggingface.co/learn/nlp-course/chapter1/1 - HF docs: https://huggingface.co/docs/transformers/en/pad_truncation - ChatGPT: https://chatgpt.com/ --- 回饋表單（拜偷拜偷）： ![image](https://hackmd.io/_uploads/r15y5hHske.png)

Cheatsheet

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.