# 作業3
## 机器配置:
* 电脑:MacBook Pro
* 内存:16GB
* 芯片:Apple M1 Pro
## 模型下载:
* 通过Hugging Face CLI工具下载Meta-LLAMA2模型,使用了申请到的授权令牌。
* 选择下载了13B参数的模型,需要26GB空间,保存在用户目录的.cache文件夹下。
```bash
pip install huggingface_hub
```
然后利用命令行工具下载模型,70B参数模型需要160GB的空间,这里下载13B参数模型,需要26GB空间。
```bash
huggingface-cli download --token hf_*** --resume-download meta-llama/Llama-2-13b
```
模型默认会保存在用户目录的.cache文件夹下。
## 尝试在CPU上运行AirLLM:
* 使用AirLLM框架进行模型推理,选择了13B参数的模型。
* 在CPU上运行时遇到了一些警告和提示,包括BetterTransformer实现不支持在训练期间进行填充,以及关于tokenizers的警告。
* 运行的推理时间和加载时间进行了记录,尝试了不同的max_new_tokens参数。
获取官方推理示例,修改为cpu运行(device="cpu",dtype=torch.float32,input_tokens后面的cuda去掉)。
```python
import torch
from airllm import AirLLMLlama2
MAX_LENGTH = 128
model = AirLLMLlama2(
"/Users/xxx/.cache/huggingface/hub/models--garage-bAInd--Platypus2-7B/snapshots/c27aff7201e611f301c0e19f351cbe74b1a9f1f1",
device="cpu",
dtype=torch.float32,
profiling_mode=True)
input_text = [
'What is the capital of United States?',
]
while True:
input_text = input("Input: ")
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False)
generation_output = model.generate(
input_tokens['input_ids'],
max_new_tokens=20,
use_cache=True,
return_dict_in_generate=True)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
```
然后运行得到结果如下:
```
Input: What is the capital of United States?
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
cpu: 100%|██████████| 35/35 [00:16<00:00, 2.08it/s]
total disk+gpu loading time: 17.6052
total infer time(including all above plus gpu compute): 24.6655
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:19<00:00, 1.78it/s]
total disk+gpu loading time: 17.7518
total infer time(including all above plus gpu compute): 22.8738
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:16<00:00, 2.19it/s]
total disk+gpu loading time: 16.6516
total infer time(including all above plus gpu compute): 21.6726
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:16<00:00, 2.13it/s]
total disk+gpu loading time: 17.4125
total infer time(including all above plus gpu compute): 22.5351
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:17<00:00, 2.05it/s]
total disk+gpu loading time: 17.9168
total infer time(including all above plus gpu compute): 23.1345
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:16<00:00, 2.12it/s]
total disk+gpu loading time: 17.2870
total infer time(including all above plus gpu compute): 22.4166
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:16<00:00, 2.15it/s]
total disk+gpu loading time: 17.3372
total infer time(including all above plus gpu compute): 22.4147
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:21<00:00, 1.65it/s]
total disk+gpu loading time: 17.9343
total infer time(including all above plus gpu compute): 22.9992
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:17<00:00, 1.99it/s]
total disk+gpu loading time: 17.1098
total infer time(including all above plus gpu compute): 22.2515
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:16<00:00, 2.11it/s]
total disk+gpu loading time: 17.0788
total infer time(including all above plus gpu compute): 22.1737
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:19<00:00, 1.80it/s]
total disk+gpu loading time: 16.7922
total infer time(including all above plus gpu compute): 21.9173
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:16<00:00, 2.13it/s]
total disk+gpu loading time: 16.9438
total infer time(including all above plus gpu compute): 21.9450
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cpu: 100%|██████████| 35/35 [00:16<00:00, 2.17it/s]
total disk+gpu loading time: 16.6218
total infer time(including all above plus gpu compute): 21.6643
<s> What is the capital of United States?
The correct answer is Washington, D.C..</s>
```
尝试修改不同参数看看运行效果
* 首先调整max_new_tokens,发现这个是用来控制输出的tokens,当设置为2时,推理会变的很快但也只输出一个word,通过不断调整这个值,我发现推理次数与这个tokens有关系,每次推理只能得到一个word
* 调整profiling_mode=True,可以输出每次推理的耗时
```
total disk+gpu loading time: 16.6218
total infer time(including all above plus gpu compute): 21.6643
```
## 尝试在GPU上运行AirLLM:
* 尝试在GPU上运行时遇到了Torch未启用CUDA的错误,发现该问题在GitHub的issue中有相关讨论,目前暂时不支持在Mac电脑上使用GPU推理。
```
AssertionError: Torch not compiled with CUDA enabled
```
翻阅github仓库的issue发现有人提过这个问题([Does'n t work on Apple M1/M2. AssertionError: Torch not compiled with CUDA enabled. · Issue #47 · lyogavin/Anima (github.com)](https://github.com/lyogavin/Anima/issues/47),开发者回答当前暂时不支持mac电脑使用GPU推理。
## 尝试预先读入内存:
* 查看AirLLM框架源码,发现模型在前向传递中逐层处理输入,并在每一层完成后释放GPU内存。
* 提到了一种优化方法,即将所有中间激活值保存在内存中,以减少I/O操作,提高模型运行速度。
查看AirLLM框架源码
> ```
> Sharded version of LlamaForCausalLM : the model is splitted into layer shards to reduce GPU memory usage.
> During the forward pass, the inputs are processed layer by layer, and the GPU memory is freed after each layer.
> To avoid loading the layers multiple times, we could save all the intermediate activations in RAM.
> ```
发现AirLLM在前向传递过程中,输入会逐层地进行处理,并且在每一层完成后 GPU 内存会被释放。为了避免多次加载层,将所有中间激活值保存在内存中,这样可以减少I/O操作,提高模型运行的速度。