Huggingface隨手筆記

# Huggingface隨手筆記 [TOC] 簡介 === 在[Huggingface官網](https://huggingface.co/)上集結了眾多NLP相關模型，並且程式碼都處理得相當精簡，使得新手想開始撰寫相關NLP之模型變得相當簡單。本篇隨手筆記參考了[此影片](https://www.youtube.com/watch?v=QEaBAZQCtwE)的教程，配合目前自己粗略的認識，來撰寫這份教學。 --- 套件下載 === 此NLP套件的核心在於transformers這個套件。 ``` pip install transformers ``` 不過若直接使用好像沒辦法正確使用，**可能**需要再用以下的指令下載額外的套件。這部分就看到時編譯時有沒有報錯了OAO。 ``` pip install transformers[sentencepiece] ``` 除此之外你可能需要下載別的套件來輔助transformers這個套件，單看個人是習慣用pytorch或是tensorflow來決定。 ``` pip install pytorch pip install tensorflow ``` 若是第一次踏入機器學習領域的人而不知道要怎麼取得資料集來練習，可以使用datasets這個套件獲得大量資料。 ``` pip install datasets ``` --- transformer應用 === 簡單應用：pipeline --- 如果只想用最快的速度寫完一個關於自然語言的應用程式，基本上程式碼不到10行就可以結束了!!!已下面的情感分析為例子： ```python= from transformers import pipeline classifier = pipeline('sentiment-analysis') result = classifier('Today is a wonderful day.') print(result) #output：[{'label': 'POSITIVE', 'score': 0.9998874664306641}] ``` 是的，簡單4行程式碼就寫完關於情感分析的任務了，是不是非常友善呢XD。簡單講解就是： 1. 第3行指定你的pipeline要做甚麼樣的工作，我這裡是下 **sentiment-analysis** 也就是**情感分析**。 2. 第4行在你的分類器裡寫入你想要分析的句子，在此例中便是**Today is a wonderful day.**。此時pipeline裡內建的tokenizer與model就會幫你分析你丟入的句子，並回傳該句子的情緒與分數。 3. 可以看見透過第5行印出的資訊在第7行的註解，是一個list包dictionary的資料結構。在字典中有兩項： * label：以表示此句話的正向還是負向情緒 * score：代表可性度多高當然你可以不只丟入一句話而已，你可以把你想分析的語句以一個list包起來並丟入classifier中便可一次幫到位 ```python= from transformers import pipeline classifier = pipeline('sentiment-analysis') result = classifier(['It is my pleasure to meet you.', 'Due to my homework, I feel pressure.']) print(result) #output：[{'label': 'POSITIVE', 'score': 0.9998533725738525}, {'label': 'NEGATIVE', 'score': 0.9901190996170044}] ``` 於是便有了兩個輸出。 --- 工作類別 --- 以下是目前官網上可以處理的工作： | 類別 | 細項 | | :--------: | :--------: | | 電腦視覺 | image classification image segmentation zero-shot image classification image-to-image unconditional image generation object detection video classification | | 自然語言處理 | translation fill-mask sentence similarity question answering summerization zero-shot classification text classification text2text generation text generation conversational table question answering | | 聲音 | automatic speech recognition audio classification text-to-speech audio-to-audio voice activity detection | | 多模型 | feature extraction text-to-image visual question answering image-to-text document question answering | | 表格 | tabular classification tabular regression | | 增強式學習 | reinforcement learning robotics | --- 更換模型與分詞器 --- 當你逐漸熟悉這個套件後，你可以開始從他們的[官網](https://huggingface.co/)中尋求你想要的模型與分詞器。[官網](https://huggingface.co/)中模型的實作都用著最新的論文去實踐這些模型，雖然pipeline在你給定任務後會生成一個默認的模型供你使用(而且大部分情況來說也夠用了OAO)，你還是可以依據自身需求選出你想要的模型。以下面的情感分析為例： ```python= from transformers import pipeline model_name = 'distilbert-base-uncased-finetuned-sst-2-english' classifier = pipeline('sentiment-analysis', model=model_name, tokenizer=model_name) result = classifier('Somehow I got frustrated if I can\'t make my goal in time.') print(result) #output：[{'label': 'NEGATIVE', 'score': 0.9994012117385864}] ``` 第3行是你想要引用的模型名稱，把模型套用進**AutoTokenizer**與**AutoModelForSequenceClassification**(4、5行)。將model與tokenizer作為參數傳入pipeline中，以覆寫裡面的模型，最後print出結果。 --- 建立自己的資料集 --- Huggingface的資料集型別為Dataset，來自datasets這個資料庫。而此資料庫更底層的資料型別為來自pyarrow此資料庫的內容。因此我們可以透過**熟悉的資料型**透過pyarrow轉變成**Table**後轉化成**Dataset**格式。 ```python= import pyarrow as pa from datasets import Dataset #data = [{'feature':data, ..., 'labels': labels}] table = pa.Table.from_pylist(data) dataset = Dataset(table) dataset = dataset.train_test_split(train_size=0.8) ``` data內的型別會是一個字典，裡面放有資料與答案(詳情可以自己更深入地去看[dataset內的結構](https://huggingface.co/learn/nlp-course/chapter5/2?fw=pt)。此外，Dataset內有同sklearn的train_test_split函數可以分開訓練集與測試集。 ## **2023/5/12更新** 最近學會了`load_dataset`此一函式，用此函式載入就不必像上面從底層開始寫了OAO. ```python= from datasets import load_dataset PATH = 'C:\path\to\Alzheimer_Dataset' dataset = load_dataset('imagefolder', data_dir=PATH) ``` ## **2024/2/02更新** 從huggingface中的`datasets`函數庫中可以挑選自己需要的輸入資料格式。詳情可以參考[api](https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.from_pandas)。 ```python= from datasets import Dataset data = Dataset.from_pandas(df) ``` --- 微調 --- 用到的基本套件是以下兩個(以seq2seq為例)： ```python= from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer ``` 基本概念如下： 1. TrainingArgument用於調整模型參數 2. Trainer裡面放入模型、tokenizer、以及上述的TrainingArgument 3. 使用trainer.train()進行訓練就完成訓練了使用起來如下： ```python= from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") tokenizer = AutoTokenizer.from_pretrained("t5-small") data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) training_args = Seq2SeqTrainingArguments( output_dir="./result", evaluation_strategy="epoch", learning_rate=LEARNING_RATE, per_device_train_batch_size=TRAIN_BATCH_SIZE, per_device_eval_batch_size=EVAL_BATCH_SIZE, weight_decay=0.01, save_total_limit=3, num_train_epochs=EPOCH, fp16=True, ) trainer = Seq2SeqTrainer( model=model, args=qTraining_args, train_dataset=qTokenizer['train'], eval_dataset=qTokenizer['test'], tokenizer=tokenizer, data_collator=data_collator ) trainer.train() #完成訓練 ``` 可以注意到第22、23行中，我放入了自己的資料及進行訓練，就不是調用from_pretrained函數了。如果有自己的資料即要進行相對應的訓練，就很推薦使用這個方法。 --- 使用微調後的模型 --- 以上述為例，訓練好的model跟tokenizer要拿去應用，只需要再放到pipeline上即可。 **注意因訓練完的model是放在cuda上的，若直接使用將會產生model與dataset在不同裝置上的error，所以上面model訓練完畢後，請記得放回到cpu上，詳情見[這裡](https://stackoverflow.com/questions/74497166/huggingface-expected-all-tensors-to-be-on-the-same-device-but-found-at-least-t/74506560#74506560)** ```python= from transformers import pipeline model = model.to('cpu') pipe = pipeline('summarization', model=model_name, tokenizer=model_name) print(pipe(...)) ``` --- Small Project：幹話生產器 === 我有一天發現，你可以用text-generation透過一句話生成一段英文文章，再利用translation翻譯成中文，看起來就有那麼回事了。只是說text-generation沒辦法生成太多文字，否則後面的文章就如同鬼打牆般沒意義，而翻譯的部分我只有找到簡體中文的部分。不過透過這段程式碼給各位新手們一點想法，希望有幫助道你們(°∀°)。 ```python= from transformers import pipeline text = 'How to make pasta?' textGeneration = pipeline('text-generation') translation = pipeline('translation', model='Helsinki-NLP/opus-mt-en-zh') output = textGeneration(text, min_length = 100, max_length = 300)[0]['generated_text'] translate = translation(output, max_length = 500)[0]['translation_text'] print(translate) ```