# 資料處理
## :memo: Where do I start?
- 聯絡窗口 Email us : 2303117@narlabs.org.tw 王小姐
### 事前準備
- 映像檔下載
```
singularity pull c00cjz00/c00cjz00_cuda11.8_pytorch:2.1.2-cuda11.8-cudnn8-devel-llama_factory
```
- 原生安裝所需套件
```
# 套件安裝ubuntu相關套件
apt install libfontconfig libaio-dev libibverbs-dev jq
# 安裝 Llama-factory 相關套件
pip install llmtuner==0.5.3 deepspeed==0.13.1 bitsandbytes==0.42.0 opencc opencc-python-reimplemented
```
## 硬體資源需求預估
### [下載 michaelwzhu/ChatMed_Consult_Dataset 醫學資料](https://huggingface.co/datasets/michaelwzhu/ChatMed_Consult_Dataset)
```
import json
from datasets import load_dataset
import opencc
#op_cc=opencc.OpenCC('s2t') # 簡體轉繁體
op_cc=opencc.OpenCC('s2twp') # 簡體轉台灣繁體
# 讀取數據集
dataset = load_dataset("michaelwzhu/ChatMed_Consult_Dataset", split="train", streaming=True,encoding='utf-8')
# 提取所需欄位並建立新的字典列表, 同時將簡體轉台灣繁體
limit=0
extracted_data = []
for example in dataset:
extracted_example = {
"instruction": op_cc.convert("现在你是一名专业的中医医生,请用你的专业知识提供详尽而清晰的关于中医问题的回答。"),
"input": op_cc.convert(example["query"]),
"output": op_cc.convert(example["response"])
}
extracted_data.append(extracted_example)
if len(extracted_data) == limit:
break
# 指定 JSON 文件名稱
json_filename = "data.json"
# 寫入 JSON 文件
with open(json_filename, "w") as json_file:
json.dump(extracted_data, json_file, indent=4)
print(f"數據已提取並保存為 {json_filename}")
```
### 取出前一萬筆, 並儲存
```
import pandas as pd
df = pd. read_json ( 'data.json' )
dataset_df_10k = df[:10000]
dataset_df_10k.to_json('LLaMA-Factory/data/medicalQA_10k.json', orient='records')
dataset_df_10k
```

### sha1sum
```
!sha1sum LLaMA-Factory/data/medicalQA.json
```
### 打開 LLaMA-Factory/data/dataset_info.json 新增以下資料
```
"medical": {
"file_name": "medicalQA_10k.json",
"file_sha1": "0db55d43479e101e542f4cd719703ece3a65cc9d ",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output"
}
},
```