# arXiv dataset downloading and preprocessing
## Task
1. 將 arXiv data 從 Amazon S3 下載到 EC2
2. 資料前處理
3. 資料過濾及相關統計
## AWS
連線到 EC2
```
ssh ubuntu@ec2-3-137-149-71.us-east-2.compute.amazonaws.com -i llm-vm.pem
```
## Download arXiv source data from Amazon S3
1. 使用 RedPajama script ✔
下載前需要先搞定slurm,下一節說明安裝方法與下載步驟
https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv
2. 參考 arXiv 官網
Full Text via S3 - arXiv info
https://info.arxiv.org/help/bulk_data_s3.html#src
3. 使用 arxiv-tools
照著 README 下載不需用 slurm,但下載下來的資料會少一筆 ( 原因不明 )
https://github.com/armancohan/arxiv-tools
## 安裝 slurm + 下載 arXiv 步驟
```
sudo apt update -y
sudo apt install slurm-wlm slurm-wlm-doc -y
```
* 透過 slurmd -C 指令查看當前 node 的硬體配置

* 設定 slurm configure ( 路徑在/etc/slurm/ )
* 透過官方提供的表單,選擇想要的內容送出就會有 slurm.conf
https://slurm.schedmd.com/configurator.html
* 目前使用的 config 如下
```
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=ip-172-31-45-84
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
SlurmctldTimeout=3600
SlurmdTimeout=300
BatchStartTimeout=3600
PropagateResourceLimits=NONE
#
#
# SCHEDULING
#FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
#SelectTypeParameters=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageType=accounting_storage/none
ClusterName=mycluster
#JobAcctGatherFrequency=30
#JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
# Acct
AccountingStorageEnforce=1
#AccountingStorageLoc=/opt/slurm/acct
#AccountingStorageType=accounting_storage/filetxt
JobCompLoc=/opt/slurm/jobcomp
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#
# COMPUTE NODES
NodeName=ip-172-31-45-84 CPUs=8 State=UNKNOWN
PartitionName=debug Nodes=ip-172-31-45-84 Default=YES MaxTime=INFINITE State=UP
```
* 創建與 config 相應的目錄和權限設置
```
rm -rf /var/spool/slurm-llnl
mkdir /var/spool/slurm-llnl
chown -R slurm.slurm /var/spool/slurm-llnl
rm -rf /var/run/slurm-llnl/
mkdir /var/run/slurm-llnl/
chown -R slurm.slurm /var/run/slurm-llnl/
sudo mkdir -p /opt/slurm
sudo chmod -Rf 777 /opt/slurm
cd /opt/slurm
touch acct
touch jobcomp
```
* 啟動和設置開機自啟
```
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmctld
systemctl enable slurmctld
```
⚠必須確保 slurmd 跟 slurmctld 都啟動


* 查看目前 slurm 的狀態

* 關閉 slurm ( 下載完 arXiv 再關 )
```
systemctl stop slurmd
systemctl disable slurmd
systemctl stop slurmctld
systemctl disable slurmctld
```
* 安裝完 slurm 後可用以下指令下載 arXiv
```
bash scripts/arxiv-kickoff-download.sh
```
* 若遇到以下問題

* 先查看 memory 資訊 `mem=1M`,並將 arxiv-download-slurm.sbatch 的 --mem-per-cpu 改成 1M

* 透過 squeue 指令查看下載進度

* 下載完成
下載下來會是一堆 tar,解壓縮後是一堆 gz ( 會夾雜一些 pdf,後續會濾掉 pdf 不做資料清洗 )
 
## Preprocessing
下載下來的 paper 會是 latex 格式,要清乾淨存成 json 格式
1. 使用 RedPajama script
`
bash scripts/arxiv-kickoff-cleaning.sh
`
* 執行後卻沒有清理,想說把 arxiv_cleaner.py 直接拿來用( arxiv_cleaner.py 是RedPajama清理資料的code )
* 結論:會清理過頭,把title, author, abstract等等的內容清掉
2. 上網找看看有沒有套件或是 code 可以使用
* 結論:沒找到適合的
3. 自己寫 ✔
* 想法:觀察 paper 內容後使用 regex 找出需要的內容 ( 不包含 references )
* 作法:分成 category, title, author, abstract, section ( section 又分成 title, text, subsection )
* 執行時間:大約 7 小時 ( 200 多萬篇 )
* ⚠下載下來的 paper 是gz檔,encoding 要用 `ISO-8859-1` 不可用 `utf-8`
* 結果如下
⚠有些 paper 沒有類別,因此設為 `"category": ""`
⚠有些 paper 在 abstract 之後的內文並沒有標題,因此這部分內容會當成同一塊
```json=
{
"id": "cond-mat0001001",
"category": "cond-mat",
"title": " Statistical thermodynamics of membrane bending mediated\nprotein-protein attractions",
"author": "Tom Chou",
"abstract": "Integral membrane proteins deform the surrounding bilayer\ncreating long-ranged forces that influence distant proteins. \nThese forces can be attractive or repulsive, depending on the\nproteins' shape, height, contact angle with the bilayer, as\nwell as the local membrane curvature.",
"section": [
{
"title": "Introduction",
"text": "Membrane proteins interact directly via screened\\nelectrostatic, van der Waal's, and hydrophobic forces. \\nThese are short ranged, operating typically over distances\\nof less than a nanometer.",
"subsection": []
},
{
"title": "Membrane inclusions and height deformation",
"text": "Small membrane deformations (on the scale of the lipid or protein\\nmolecules) can be accurately modeled using standard plate theory.",
"subsection": []
},
{
"title": "Rotationally averaged interactions",
"text": "",
"subsection": [
{
"title": "Zero background curvature",
"text": "\\n\\nFirst consider the case of two isolated proteins embedded in a\\nflat membrane. In the absence of external mechanical forces\\nthat impose background membrane deformations, and with other\\ninclusions sufficiently far away."
}
]
}
]
}
```
* 程式碼
```python=
def extract_subsection(text):
subsection_pattern = r'\\subsection\*?\s*\{([\s\S]*?)\}([\s\S]+?)(?=(?:\\subsection)|\Z)'
subsection_matches = re.findall(subsection_pattern, text, re.DOTALL)
subsections = []
for subsection_match in subsection_matches:
title = subsection_match[0].strip()
if "\\" in title:
continue
content = subsection_match[1].strip()
subsections.append({'title': title, 'text': content})
return subsections
def extract_latex_content(latex_text, file_name):
latex_text = re.sub(r'\\(bf|em|Large|sc|LARGE|ls|normalsize)', '', latex_text)
latex_text = re.sub(r'%', '', latex_text)
index = latex_text.find("\\begin{thebibliography}")
if index:
latex_text = latex_text[:index]
index = latex_text.find("\\begin{references}")
if index:
latex_text = latex_text[:index]
title_pattern = r'\\title\s*\{([\s\S]*?)\\'
title_pattern2 = r'\\title\s*\{([\s\S]*?)\}'
author_pattern = r'\\author\s*\{([\s\S]*?)\}'
author_pattern2 = r'\\author\s*\{([\s\S]*?)\\'
abstract_pattern = r'\\begin\s*\{abstract\}([\s\S]*?)\\end\{abstract\}'
abstract_pattern2 = r'\s*\wbstract\}([\s\S]*?)(?=(?:\\)|\Z)'
abstract_pattern3 = r'\\abstract\s*\{([\s\S]*?)\}'
abstract_pattern4 = r'\\begin\{center\}\s*\{\s*Abstract\s*\}\s*\\end\{center\}\s*([\s\S]*?)\\'
extracted_content = {}
# id
id_pattern = r'(.+)\.gz'
id_match = re.search(id_pattern, file_name)
if id_match:
extracted_content['id'] = id_match.group(1)
else:
extracted_content['id'] = ''
# category
category_pattern = r'([A-Za-z-]+)\d+\.gz'
category_match = re.search(category_pattern, file_name)
if category_match:
extracted_content['category'] = category_match.group(1)
else:
extracted_content['category'] = ''
# title
latex_text_tmp = latex_text
latex_text_tmp = re.sub(r'\\\\', '', latex_text_tmp)
title_match = re.search(title_pattern, latex_text_tmp)
title_match2 = re.search(title_pattern2, latex_text_tmp)
extracted2 = ''
if title_match:
extracted = title_match.group(1).strip()
extracted = re.sub(r'[^A-Za-z\s.,-]', '', extracted)
extracted_content['title'] = extracted
elif title_match2:
extracted2 = title_match2.group(1).strip()
extracted2 = re.sub(r'[^A-Za-z\s.,-]', '', extracted2)
extracted_content['title'] = extracted2
if title_match and title_match2:
if len(extracted) > len(extracted2):
extracted_content['title'] = extracted
else:
extracted_content['title'] = extracted2
else:
extracted_content['title'] = ''
# author
author_match = re.search(author_pattern, latex_text)
author_match2 = re.search(author_pattern2, latex_text)
if author_match:
extracted = author_match.group(1).strip()
extracted = re.sub(r'[^A-Za-z\s.,-]', '', extracted)
extracted_content['author'] = extracted
if author_match2:
extracted2 = author_match2.group(1).strip()
extracted2 = re.sub(r'[^A-Za-z\s.,-]', '', extracted2)
extracted_content['author'] = extracted2
if author_match and author_match2:
if len(extracted) > len(extracted2):
extracted_content['author'] = extracted
else:
extracted_content['author'] = extracted2
else:
extracted_content['author'] = ''
# abstract
abstract_match = re.search(abstract_pattern, latex_text, re.DOTALL)
abstract_match2 = re.search(abstract_pattern2, latex_text, re.DOTALL)
abstract_match3 = re.search(abstract_pattern3, latex_text, re.DOTALL)
abstract_match4 = re.search(abstract_pattern4, latex_text, re.DOTALL)
if abstract_match:
extracted_content['abstract'] = abstract_match.group(1).strip()
elif abstract_match2:
extracted_content['abstract'] = abstract_match2.group(1).strip()
elif abstract_match3:
extracted_content['abstract'] = abstract_match3.group(1).strip()
elif abstract_match4:
extracted_content['abstract'] = abstract_match4.group(1).strip()
else:
extracted_content['abstract'] = ''
if len(extracted_content['abstract']) < 10:
extracted_content['abstract'] = ''
# section and subsection
section_pattern = r'\\section\*?\s*\{([\s\S]*?)\}([\s\S]+?)(?=(?:\\section)|\Z)'
section_matches = re.findall(section_pattern, latex_text, re.DOTALL)
sections = []
for section_match in section_matches:
title = section_match[0].strip()
if "\\" in title:
continue
content = section_match[1].strip()
content = content.encode('unicode_escape').decode('utf-8', errors='ignore')
index = content.find("\\x0")
if index:
content = content[:index]
subsections = extract_subsection(content)
if "subsection" in content:
content = ''
sections.append({'title': title, 'text': content, 'subsection': subsections})
if sections == []:
start = latex_text.find("\\end{abstract}")
if start:
content = latex_text[start+14:]
sections.append({'title': '', 'text': content, 'subsection': []})
extracted_content['section'] = sections
return extracted_content
```
## Data Filter
幾乎每篇 paper 都有官網中顯⽰的category,會在檔名出現 (例外:有幾篇會沒有category,原因不明)
https://arxiv.org/category_taxonomy
1. 過濾出電⼒相關的paper
* 檔名包含 `eess` 或 `cond-mat` 或 `physics` 是與電力相關的類別
* 全部共有 `2096748` 篇,電⼒相關有 `86035` 篇
2. 統計數量及token
* 將過濾出電⼒相關的 paper 統計成以下 json 格式
* `each_file_token`:每篇 paper 每個類別各自的 token 數
* `each_file_token_total`:所有 paper 每個類別的總 token 數
* `total`:總共有幾篇 paper 以及所有 paper 所有類別相加後的總 token 數
* 計算token數
```python=
encoding = tiktoken.get_encoding("cl100k_base")
num_tokens = len(encoding.encode(string))
```
* 總 token 數為 `3133613908` ( "category": `233005`, "title": `1094361`, "author": `1068058`, "abstract": `17482189`, "section": `3113736295` )
* ==更新v2== 總 token 數為 `1081346298` ( "category": `233005`, "title": `1095132`, "author": `1068929`, "abstract": `17494974`, "section": `1061454258` )
* ==更新v3== 總 token 數為 `1080476015` ( "id": `491110`, "category": `233005`, "title": `1081319`, "author": `1064310`, "abstract": `13022915`, "section": `1064583356` )
* 格式如下
```json=
{
"each_file_token": [
{
"file": "physics0610057.json",
"id": 4,
"category": 1,
"title": 19,
"author": 5,
"abstract": 453,
"section": 22435
},
{
"file": "cond-mat0001001.json",
"id": 6,
"category": 3,
"title": 12,
"author": 3,
"abstract": 240,
"section": 14532
},
{
"file": "cond-mat0608474.json",
"id": 6,
"category": 3,
"title": 10,
"author": 5,
"abstract": 118,
"section": 5260
},
],
"each_file_token_total": {
"id": 16,
"category": 7,
"title": 41,
"author": 13,
"abstract": 811,
"section": 42227
},
"total": {
"file": 3,
"token": 43115
}
}
```
## 遇到問題
* 問題:檔案大小不合理
>arXiv paper 200 多萬篇 ( 未做前處理 ),共佔 3.2T
arXiv paper 200 多萬篇 ( 已做前處理 ),共佔 3.2T
過濾出電力相關有 8 萬多篇 ( 已做前處理 ),共佔 7G
* 發現:有部分 paper 會產生大量亂碼 ( 像是`\u0000` ),造成該檔案變過大

* 解決:使用 [RedPajama 的 arxiv_cleaner.py](https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/arxiv/arxiv_cleaner.py) 來清理會有大量亂碼的 paper,其他 paper 則還是使用原本自己寫的 code 清
* 結果v2
>arXiv paper 200 多萬篇 ( 未做前處理 ),共佔 3.2T
arXiv paper 200 多萬篇 ( 已做前處理 ),共佔 489G
過濾出電力相關有 8 萬多篇 ( 已做前處理 ),共佔 3.3G
### ⚠新發現 (另一種解決方法,效果更好)
* 發現:會產生大量亂碼 ( 像是`\u0000` ) 其實是因為 gz 檔包了很多不只是 latex 格式的檔案 ( 如下圖 ),那些檔案若被讀取就會變成亂碼
 
* 解決:修改原本自己寫的 code,只讀取 tex 檔,因為此作法已可解決亂碼問題,所以不再使用 RedPajama 的 arxiv_cleaner.py
```python=
with gzip.open(file_path, 'rt', encoding='ISO-8859-1') as f:
try:
with tarfile.open(file_path) as sub_tf:
for member in sub_tf.getmembers():
if member.name.endswith(".tex"):
if sub_tf.extractfile(member) == None:
continue
latex = sub_tf.extractfile(member).read()
try:
latex = latex.decode('utf-8')
except UnicodeDecodeError:
latex = latex.decode('latin-1', errors='replace')
latex_content += latex
except tarfile.ReadError:
latex_content = f.read()
```
* 結果v3
>arXiv paper 200 多萬篇 ( 未做前處理 ),共佔 3.2T
arXiv paper 200 多萬篇 ( 已做前處理 ),共佔 115G
過濾出電力相關有 8 萬多篇 ( 已做前處理 ),共佔 3.3G
## Merge json files
將過濾出電力相關 8 萬多篇的 json 檔案合併成一個 jsonl 檔案
* 程式碼
```python=
import sys,os
import glob
import json
if __name__ == "__main__":
folder_path = os.getcwd()
target_dir = os.path.join(folder_path, "electricity")
assert os.path.isdir(target_dir)
files = glob.glob(os.path.join(target_dir,"*.json"))
print(files)
out = open("electricity.jsonl","w")
for file in files:
with open(file,"r") as f:
data = json.load(f)
out.write(json.dumps(data,ensure_ascii=False)+"\n")
print("Finish")
```
* 格式如下
```jsonl=
{"id": "cond-mat0002097", "category": "cond-mat", "title": "Charge localization and phonon spectra in hole doped LaNiO", "author": "R. J. McQueeney, A. R. Bishop, and Ya-Sha Yi", "abstract": "The in-plane oxygen vibrations in La$_{2}$NiO$_{4}$ are investigated for \nseveral hole-doping concentrations both theoretically and experimentally via \ninelastic neutron scattering. Using an inhomogeneous Hartree-Fock plus RPA \nnumerical method in a two-dimensional Peierls-Hubbard model, it is\nfound that the doping induces stripe ordering of localized charges,\nand that the strong electron-lattice coupling causes the in-plane \noxygen modes to split into two subbands. This result\nagrees with the phonon band splitting observed by inelastic neutron \nscattering in La$_{2-x}$Sr$_{x}$NiO$_{4}$.\nPredictions of strong electron-lattice coupling in La$_{2}$NiO$_{4}$,\nthe proximity of both oxygen-centered and nickel-centered charge\nordering, and the relation between charged stripe ordering and the\nsplitting of the in-plane phonon band upon doping are emphasized.", "section": [{"title": "Appendixes", "text": "", "subsection": []}]}
{"id": "cond-mat0004401", "category": "cond-mat", "title": "Particle dynamics in sheared granular matter", "author": "W. Losert, L. Bocquet,, T.C. Lubensky, \nand J.P. Gollub,", "abstract": "The particle dynamics and shear forces of granular\nmatter in a Couette geometry are determined experimentally. \nThe normalized tangential velocity $V(y)$ declines strongly with distance\n$y$ from the moving wall,\nindependent of the shear rate and of the shear dynamics.\nLocal RMS velocity fluctuations \n$\\delta V(y)$\nscale with the local velocity gradient to the power $0.4 \\pm 0.05$. \nThese results agree with a locally Newtonian, \ncontinuum model, where the granular medium is assumed to behave as a \nliquid with a local temperature $\\delta V(y)^2$ and density dependent\nviscosity.", "section": [{"title": "Acknowledgments", "text": "We thank A. Liu, H. Jaeger and C. Bizon for helpful discussions.\\nPart of this work was supported by the National Science Foundation under \\nGrant DMR-9704301", "subsection": []}]}
```
>**⚠ 有些資料欄位空白為正常現象,原因如下:**
>1. 有少部分論文的格式不一致,因此無法取得該資訊。
>2. 該論文本來就沒有該欄位的資訊
## 補充
ArXiv metadata set
https://github.com/mattbierbaum/arxiv-public-datasets
有人整理好的 ArXiv 資料集( json ),但只包含以下項目,並沒有完整內文
* 格式如下
```json=
{
"id":ArXiv ID,
"submitter":論文是誰提交的,
"authors":論文作者,
"title":論文標題,
"comments":附加信息,例如頁數和數字,
"journal-ref":有關該論文發表的期刊的信息,
"doi":https://www.doi.org,
"abstract":論文摘要,
"categories":ArXiv系統中的類別/標籤,
"versions":版本歷史,
}
```