# Dataset loading方式 ### Dataset loading方式 方法一: 較簡單,適用單一dataset, 指定DATA_PATH/MERGE_FILE/VOCAB_FILE, 決定train/valid/test split ratio即可, 執行訓練主程式後再split. 上一節範例即用此方式。 ```bash VOCAB_FILE=$MEGATRON_DEEPSPEED_REPO/data/gpt2-vocab.json MERGE_FILE=$MEGATRON_DEEPSPEED_REPO/data/gpt2-merges.txt DATA_PATH=$MEGATRON_DEEPSPEED_REPO/data/meg-gpt2-oscar-en-10k_text_document export CMD=" \ ... --data-path $DATA_PATH \ ... --split 949,50,1 \ ... ``` 方法二: 適用多份dataset, 需準備含dataset ratio的json file(e.g. training_dataset_ratios_merged_nigercongo_v3.json, CATALOGUE_JSON_PATH),在training script轉成train[/valid]-split.txt,以train為例,該txt中主要包含:  (上圖例為部分內容) ```bash TRAIN_DATA_PATH=$MEGATRON_DEEPSPEED_REPO/data/train-splits.txt VALID_DATA_PATH=$MEGATRON_DEEPSPEED_REPO/data/valid-splits.txt CATALOGUE_JSON_PATH=$BIGSCIENCE_REPO/data/catalogue/training_dataset_ratios_merged_nigercongo_v3.json LOAD_RATIOS_SCRIPT=$BIGSCIENCE_REPO/data/catalogue/load_ratios_meg_ds_format.py python $LOAD_RATIOS_SCRIPT --dataset-ratios-path $CATALOGUE_JSON_PATH --split train --output-meg-ds-ratio-file $TRAIN_DATA_PATH python $LOAD_RATIOS_SCRIPT --dataset-ratios-path $CATALOGUE_JSON_PATH --split valid --output-meg-ds-ratio-file $VALID_DATA_PATH export CMD=" \ ... --train-weighted-split-paths-path $TRAIN_DATA_PATH \ --valid-weighted-split-paths-path $VALID_DATA_PATH \ ... ```
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up