Hugging Face - HackMD

# Hugging Face ###### tags: `CodeStyle` `OpenSource` [Hugging Face](https://en.wikipedia.org/wiki/Hugging_Face) 是一个面向机器翻译及相关领域的模型&代码共享平台，其宗旨就是提便利好用的预训练模型，用于下游任务的使用及调优。其主要代码库 [Transformers](https://huggingface.co/docs/transformers/index)，包含了大量的预训练模型。Tutorial 从 pipeline for inference, load pretrained instances with an AutoClass, preprocess, fine-tune a pretrained model, distributed training with 🤗 accelerate, share a model 六个角度展开介绍，这里主要关注前四个并进行展开。 ## Pipelines for inference ([Link](https://huggingface.co/docs/transformers/pipeline_tutorial)) ### Pipeline [pipeline()](https://huggingface.co/docs/transformers/pipeline_tutorial) 提供了一个很简洁的推断接口，如使用 openai/wisplenet 进行语音识别，仅需要三行代码： ```python from transformers import pipeline generator = pipeline(model="openai/whisper-large") generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") {'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} ``` 同时也对输入类型进行了拓展，如数据集/网络输入等。 ### Parameters pipeline 支持传参，包括各个模型所需要的各种参数，使用时需要参考模型的doc。常用的参数包括 device， batch size（默认为1）, task-specific parameters. ### MultiModal Pipeline 除了模型上的拓展，对多模态的封装也很一致， - 图像 ```python from transformers import pipeline vision_classifier = pipeline(model="google/vit-base-patch16-224") preds = vision_classifier( images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" ) preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] [{'score': 0.4335, 'label': 'lynx, catamount'}, {'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'}, {'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'}, {'score': 0.0239, 'label': 'Egyptian cat'}, {'score': 0.0229, 'label': 'tiger cat'}] ``` - 文本 ```python from transformers import pipeline # This model is a `zero-shot-classification` model. # It will classify text, except you are free to choose any label you might imagine classifier = pipeline(model="facebook/bart-large-mnli") classifier( "I have a problem with my iphone that needs to be resolved asap!!", candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"], ) {'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], 'scores': [0.504, 0.479, 0.013, 0.003, 0.002]} ``` - VQA ```python from transformers import pipeline vqa = pipeline(model="impira/layoutlm-document-qa") vqa( image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png", question="What is the invoice number?", ) [{'score': 0.42514941096305847, 'answer': 'us-001', 'start': 16, 'end': 16}] ``` ## Load pretrained instances with an AutoClass ([Link](https://huggingface.co/docs/transformers/autoclass_tutorial)) Transformers core philosophy to make the library easy, simple and flexible to use, an AutoClass **automatically infer and load the correct architecture** from a given checkpoint. The `from_pretrained()` method lets you quickly load a pretrained model for any architecture so you don’t have to devote time and resources to train a model from scratch. Producing this type of checkpoint-agnostic code means if your code works for one checkpoint, it will work with another checkpoint - as long as it was trained for a similar task - even if the architecture is different. - AutoTokenizer ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ``` - AutoImageProcessor ```python from transformers import AutoImageProcessor image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") ``` - AutoFeatureExtractor ```python from transformers import AutoFeatureExtractor feature_extractor = AutoFeatureExtractor.from_pretrained( "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" ) ``` - AutoProcessor ```python from transformers import AutoProcessor processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased") ``` - AutoModel ```python from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") ``` Same pre-trained model for different work: ```python from transformers import AutoModelForTokenClassification model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased") ``` ## Preprocess ([Link](https://huggingface.co/docs/transformers/preprocessing)) 对输入输出的处理也进行了封装。 ### Text use a *Tokenizer* to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors. - **Tokenization** ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.") print(encoded_input) {'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} tokenizer.decode(encoded_input["input_ids"]) '[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]' ``` - **Pad** ```python batch_sentences = [ "But what about second breakfast?", "Don't think he knows about second breakfast, Pip.", "What about elevensies?", ] encoded_input = tokenizer(batch_sentences, padding=True) print(encoded_input) {'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} ``` - **Truncation** ```python= batch_sentences = [ "But what about second breakfast?", "Don't think he knows about second breakfast, Pip.", "What about elevensies?", ] encoded_input = tokenizer(batch_sentences, padding=True, truncation=True) print(encoded_input) {'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} ``` - **Build Tensors** Set the return_tensors parameter to either pt for PyTorch, or tf for TensorFlow: ```python= batch_sentences = [ "But what about second breakfast?", "Don't think he knows about second breakfast, Pip.", "What about elevensies?", ] encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt") print(encoded_input) {'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])} ``` ### Speech and audio use a *Feature extractor* to extract sequential features from audio waveforms and convert them into tensors. ### Image inputs use a *ImageProcessor* to convert images into tensors. ### Multimodal inputs use a *Processor* to combine a tokenizer and a feature extractor or image processor. ## Fine-tune a pretrained model ([Link](https://huggingface.co/docs/transformers/training#finetune-a-pretrained-model)) ### Prepare a dataset ```python from datasets import load_dataset from transformers import AutoTokenizer def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True) dataset = load_dataset("yelp_review_full") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") tokenized_datasets = dataset.map(tokenize_function, batched=True) ``` Generate a smaller subset to debug: ```python small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) ``` ### Train in native PyTorch - Prepare ```python= from torch.utils.data import DataLoader from transformers import AutoModelForSequenceClassification # load the dataset and the model train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8) eval_dataloader = DataLoader(small_eval_dataset, batch_size=8) model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) # create an optimizer from torch.optim import AdamW optimizer = AdamW(model.parameters(), lr=5e-5) # learning rate scheduler from transformers import get_scheduler num_epochs = 3 num_training_steps = num_epochs * len(train_dataloader) lr_scheduler = get_scheduler( name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps ) # specify device import torch device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") model.to(device) ``` - Training loop ```python= from tqdm.auto import tqdm progress_bar = tqdm(range(num_training_steps)) model.train() for epoch in range(num_epochs): for batch in train_dataloader: batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1) ``` - Evaluation The 🤗 Evaluate library provides a simple accuracy function you can load with the evaluate.load (see this quicktour for more information) function. ```python import evaluate metric = evaluate.load("accuracy") model.eval() for batch in eval_dataloader: batch = {k: v.to(device) for k, v in batch.items()} with torch.no_grad(): outputs = model(**batch) logits = outputs.logits predictions = torch.argmax(logits, dim=-1) metric.add_batch(predictions=predictions, references=batch["labels"]) metric.compute() ```