arXiv dataset download and preprocessing

# arXiv dataset downloading and preprocessing ## Task 1. 將 arXiv data 從 Amazon S3 下載到 EC2 2. 資料前處理 3. 資料過濾及相關統計 ## AWS 連線到 EC2 ``` ssh ubuntu@ec2-3-137-149-71.us-east-2.compute.amazonaws.com -i llm-vm.pem ``` ## Download arXiv source data from Amazon S3 1. 使用 RedPajama script ✔ 下載前需要先搞定slurm，下一節說明安裝方法與下載步驟 https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv 2. 參考 arXiv 官網 Full Text via S3 - arXiv info https://info.arxiv.org/help/bulk_data_s3.html#src 3. 使用 arxiv-tools 照著 README 下載不需用 slurm，但下載下來的資料會少一筆 ( 原因不明 ) https://github.com/armancohan/arxiv-tools ## 安裝 slurm + 下載 arXiv 步驟 ``` sudo apt update -y sudo apt install slurm-wlm slurm-wlm-doc -y ``` * 透過 slurmd -C 指令查看當前 node 的硬體配置 ![](https://hackmd.io/_uploads/S1A0T8m33.png) * 設定 slurm configure ( 路徑在/etc/slurm/ ) * 透過官方提供的表單，選擇想要的內容送出就會有 slurm.conf https://slurm.schedmd.com/configurator.html * 目前使用的 config 如下 ``` # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=ip-172-31-45-84 #ControlAddr= # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root #SlurmdUser=root StateSaveLocation=/var/spool/slurm-llnl SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 SlurmctldTimeout=3600 SlurmdTimeout=300 BatchStartTimeout=3600 PropagateResourceLimits=NONE # # # SCHEDULING #FastSchedule=1 SchedulerType=sched/backfill SelectType=select/linear #SelectTypeParameters= # # # LOGGING AND ACCOUNTING #AccountingStorageType=accounting_storage/none ClusterName=mycluster #JobAcctGatherFrequency=30 #JobAcctGatherType=jobacct_gather/none #SlurmctldDebug=3 #SlurmctldLogFile= #SlurmdDebug=3 #SlurmdLogFile= # # Acct AccountingStorageEnforce=1 #AccountingStorageLoc=/opt/slurm/acct #AccountingStorageType=accounting_storage/filetxt JobCompLoc=/opt/slurm/jobcomp JobCompType=jobcomp/filetxt JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux # # COMPUTE NODES NodeName=ip-172-31-45-84 CPUs=8 State=UNKNOWN PartitionName=debug Nodes=ip-172-31-45-84 Default=YES MaxTime=INFINITE State=UP ``` * 創建與 config 相應的目錄和權限設置 ``` rm -rf /var/spool/slurm-llnl mkdir /var/spool/slurm-llnl chown -R slurm.slurm /var/spool/slurm-llnl rm -rf /var/run/slurm-llnl/ mkdir /var/run/slurm-llnl/ chown -R slurm.slurm /var/run/slurm-llnl/ sudo mkdir -p /opt/slurm sudo chmod -Rf 777 /opt/slurm cd /opt/slurm touch acct touch jobcomp ``` * 啟動和設置開機自啟 ``` systemctl start slurmd systemctl enable slurmd systemctl start slurmctld systemctl enable slurmctld ``` ⚠必須確保 slurmd 跟 slurmctld 都啟動 ![](https://hackmd.io/_uploads/S164bDm3h.png) ![](https://hackmd.io/_uploads/Sye8ZP7hh.png) * 查看目前 slurm 的狀態 ![](https://hackmd.io/_uploads/HJknxvQnh.png) * 關閉 slurm ( 下載完 arXiv 再關 ) ``` systemctl stop slurmd systemctl disable slurmd systemctl stop slurmctld systemctl disable slurmctld ``` * 安裝完 slurm 後可用以下指令下載 arXiv ``` bash scripts/arxiv-kickoff-download.sh ``` * 若遇到以下問題 ![](https://hackmd.io/_uploads/r1eCNVwX22.png) * 先查看 memory 資訊 `mem=1M`，並將 arxiv-download-slurm.sbatch 的 --mem-per-cpu 改成 1M ![](https://hackmd.io/_uploads/HJz0VDmnn.png) * 透過 squeue 指令查看下載進度 ![](https://hackmd.io/_uploads/rJdzUwmhn.png) * 下載完成下載下來會是一堆 tar，解壓縮後是一堆 gz ( 會夾雜一些 pdf，後續會濾掉 pdf 不做資料清洗 ) ![](https://hackmd.io/_uploads/rk41e-cnn.png =40%x) ![](https://hackmd.io/_uploads/rk0PgZq23.png =28%x) ## Preprocessing 下載下來的 paper 會是 latex 格式，要清乾淨存成 json 格式 1. 使用 RedPajama script ` bash scripts/arxiv-kickoff-cleaning.sh ` * 執行後卻沒有清理，想說把 arxiv_cleaner.py 直接拿來用( arxiv_cleaner.py 是RedPajama清理資料的code ) * 結論：會清理過頭，把title, author, abstract等等的內容清掉 2. 上網找看看有沒有套件或是 code 可以使用 * 結論：沒找到適合的 3. 自己寫 ✔ * 想法：觀察 paper 內容後使用 regex 找出需要的內容 ( 不包含 references ) * 作法：分成 category, title, author, abstract, section ( section 又分成 title, text, subsection ) * 執行時間：大約 7 小時 ( 200 多萬篇 ) * ⚠下載下來的 paper 是gz檔，encoding 要用 `ISO-8859-1` 不可用 `utf-8` * 結果如下 ⚠有些 paper 沒有類別，因此設為 `"category": ""` ⚠有些 paper 在 abstract 之後的內文並沒有標題，因此這部分內容會當成同一塊 ```json= { "id": "cond-mat0001001", "category": "cond-mat", "title": " Statistical thermodynamics of membrane bending mediated\nprotein-protein attractions", "author": "Tom Chou", "abstract": "Integral membrane proteins deform the surrounding bilayer\ncreating long-ranged forces that influence distant proteins. \nThese forces can be attractive or repulsive, depending on the\nproteins' shape, height, contact angle with the bilayer, as\nwell as the local membrane curvature.", "section": [ { "title": "Introduction", "text": "Membrane proteins interact directly via screened\\nelectrostatic, van der Waal's, and hydrophobic forces. \\nThese are short ranged, operating typically over distances\\nof less than a nanometer.", "subsection": [] }, { "title": "Membrane inclusions and height deformation", "text": "Small membrane deformations (on the scale of the lipid or protein\\nmolecules) can be accurately modeled using standard plate theory.", "subsection": [] }, { "title": "Rotationally averaged interactions", "text": "", "subsection": [ { "title": "Zero background curvature", "text": "\\n\\nFirst consider the case of two isolated proteins embedded in a\\nflat membrane. In the absence of external mechanical forces\\nthat impose background membrane deformations, and with other\\ninclusions sufficiently far away." } ] } ] } ``` * 程式碼 ```python= def extract_subsection(text): subsection_pattern = r'\\subsection\*?\s*\{([\s\S]*?)\}([\s\S]+?)(?=(?:\\subsection)|\Z)' subsection_matches = re.findall(subsection_pattern, text, re.DOTALL) subsections = [] for subsection_match in subsection_matches: title = subsection_match[0].strip() if "\\" in title: continue content = subsection_match[1].strip() subsections.append({'title': title, 'text': content}) return subsections def extract_latex_content(latex_text, file_name): latex_text = re.sub(r'\$bf|em|Large|sc|LARGE|ls|normalsize)', '', latex_text) latex_text = re.sub(r'%', '', latex_text) index = latex_text.find("\\begin{thebibliography}") if index: latex_text = latex_text[:index] index = latex_text.find("\\begin{references}") if index: latex_text = latex_text[:index] title_pattern = r'\\title\s*\{([\s\S]*?)\\' title_pattern2 = r'\\title\s*\{([\s\S]*?)\}' author_pattern = r'\\author\s*\{([\s\S]*?)\}' author_pattern2 = r'\\author\s*\{([\s\S]*?)\\' abstract_pattern = r'\\begin\s*\{abstract\}([\s\S]*?)\\end\{abstract\}' abstract_pattern2 = r'\s*\wbstract\}([\s\S]*?)(?=(?:\$|\Z)' abstract_pattern3 = r'\\abstract\s*\{([\s\S]*?)\}' abstract_pattern4 = r'\\begin\{center\}\s*\{\s*Abstract\s*\}\s*\\end\{center\}\s*([\s\S]*?)\\' extracted_content = {} # id id_pattern = r'(.+)\.gz' id_match = re.search(id_pattern, file_name) if id_match: extracted_content['id'] = id_match.group(1) else: extracted_content['id'] = '' # category category_pattern = r'([A-Za-z-]+)\d+\.gz' category_match = re.search(category_pattern, file_name) if category_match: extracted_content['category'] = category_match.group(1) else: extracted_content['category'] = '' # title latex_text_tmp = latex_text latex_text_tmp = re.sub(r'\\\\', '', latex_text_tmp) title_match = re.search(title_pattern, latex_text_tmp) title_match2 = re.search(title_pattern2, latex_text_tmp) extracted2 = '' if title_match: extracted = title_match.group(1).strip() extracted = re.sub(r'[^A-Za-z\s.,-]', '', extracted) extracted_content['title'] = extracted elif title_match2: extracted2 = title_match2.group(1).strip() extracted2 = re.sub(r'[^A-Za-z\s.,-]', '', extracted2) extracted_content['title'] = extracted2 if title_match and title_match2: if len(extracted) > len(extracted2): extracted_content['title'] = extracted else: extracted_content['title'] = extracted2 else: extracted_content['title'] = '' # author author_match = re.search(author_pattern, latex_text) author_match2 = re.search(author_pattern2, latex_text) if author_match: extracted = author_match.group(1).strip() extracted = re.sub(r'[^A-Za-z\s.,-]', '', extracted) extracted_content['author'] = extracted if author_match2: extracted2 = author_match2.group(1).strip() extracted2 = re.sub(r'[^A-Za-z\s.,-]', '', extracted2) extracted_content['author'] = extracted2 if author_match and author_match2: if len(extracted) > len(extracted2): extracted_content['author'] = extracted else: extracted_content['author'] = extracted2 else: extracted_content['author'] = '' # abstract abstract_match = re.search(abstract_pattern, latex_text, re.DOTALL) abstract_match2 = re.search(abstract_pattern2, latex_text, re.DOTALL) abstract_match3 = re.search(abstract_pattern3, latex_text, re.DOTALL) abstract_match4 = re.search(abstract_pattern4, latex_text, re.DOTALL) if abstract_match: extracted_content['abstract'] = abstract_match.group(1).strip() elif abstract_match2: extracted_content['abstract'] = abstract_match2.group(1).strip() elif abstract_match3: extracted_content['abstract'] = abstract_match3.group(1).strip() elif abstract_match4: extracted_content['abstract'] = abstract_match4.group(1).strip() else: extracted_content['abstract'] = '' if len(extracted_content['abstract']) < 10: extracted_content['abstract'] = '' # section and subsection section_pattern = r'\\section\*?\s*\{([\s\S]*?)\}([\s\S]+?)(?=(?:\\section)|\Z)' section_matches = re.findall(section_pattern, latex_text, re.DOTALL) sections = [] for section_match in section_matches: title = section_match[0].strip() if "\\" in title: continue content = section_match[1].strip() content = content.encode('unicode_escape').decode('utf-8', errors='ignore') index = content.find("\\x0") if index: content = content[:index] subsections = extract_subsection(content) if "subsection" in content: content = '' sections.append({'title': title, 'text': content, 'subsection': subsections}) if sections == []: start = latex_text.find("\\end{abstract}") if start: content = latex_text[start+14:] sections.append({'title': '', 'text': content, 'subsection': []}) extracted_content['section'] = sections return extracted_content ``` ## Data Filter 幾乎每篇 paper 都有官網中顯⽰的category，會在檔名出現 (例外：有幾篇會沒有category，原因不明) https://arxiv.org/category_taxonomy 1. 過濾出電⼒相關的paper * 檔名包含 `eess` 或 `cond-mat` 或 `physics` 是與電力相關的類別 * 全部共有 `2096748` 篇，電⼒相關有 `86035` 篇 2. 統計數量及token * 將過濾出電⼒相關的 paper 統計成以下 json 格式 * `each_file_token`：每篇 paper 每個類別各自的 token 數 * `each_file_token_total`：所有 paper 每個類別的總 token 數 * `total`：總共有幾篇 paper 以及所有 paper 所有類別相加後的總 token 數 * 計算token數 ```python= encoding = tiktoken.get_encoding("cl100k_base") num_tokens = len(encoding.encode(string)) ``` * 總 token 數為 `3133613908` ( "category": `233005`, "title": `1094361`, "author": `1068058`, "abstract": `17482189`, "section": `3113736295` ) * ==更新v2== 總 token 數為 `1081346298` ( "category": `233005`, "title": `1095132`, "author": `1068929`, "abstract": `17494974`, "section": `1061454258` ) * ==更新v3== 總 token 數為 `1080476015` ( "id": `491110`, "category": `233005`, "title": `1081319`, "author": `1064310`, "abstract": `13022915`, "section": `1064583356` ) * 格式如下 ```json= { "each_file_token"： [ { "file"： "physics0610057.json", "id"： 4, "category"： 1, "title"： 19, "author"： 5, "abstract"： 453, "section"： 22435 }, { "file"： "cond-mat0001001.json", "id"： 6, "category"： 3, "title"： 12, "author"： 3, "abstract"： 240, "section"： 14532 }, { "file"： "cond-mat0608474.json", "id"： 6, "category"： 3, "title"： 10, "author"： 5, "abstract"： 118, "section"： 5260 }, ], "each_file_token_total": { "id"： 16, "category"： 7, "title"： 41, "author"： 13, "abstract"： 811, "section"： 42227 }, "total": { "file"： 3, "token"： 43115 } } ``` ## 遇到問題 * 問題：檔案大小不合理 >arXiv paper 200 多萬篇 ( 未做前處理 )，共佔 3.2T arXiv paper 200 多萬篇 ( 已做前處理 )，共佔 3.2T 過濾出電力相關有 8 萬多篇 ( 已做前處理 )，共佔 7G * 發現：有部分 paper 會產生大量亂碼 ( 像是`\u0000` )，造成該檔案變過大 ![](https://hackmd.io/_uploads/ryEuIZ923.png =40%x) * 解決：使用 [RedPajama 的 arxiv_cleaner.py](https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/arxiv/arxiv_cleaner.py) 來清理會有大量亂碼的 paper，其他 paper 則還是使用原本自己寫的 code 清 * 結果v2 >arXiv paper 200 多萬篇 ( 未做前處理 )，共佔 3.2T arXiv paper 200 多萬篇 ( 已做前處理 )，共佔 489G 過濾出電力相關有 8 萬多篇 ( 已做前處理 )，共佔 3.3G ### ⚠新發現 (另一種解決方法，效果更好) * 發現：會產生大量亂碼 ( 像是`\u0000` ) 其實是因為 gz 檔包了很多不只是 latex 格式的檔案 ( 如下圖 )，那些檔案若被讀取就會變成亂碼 ![](https://hackmd.io/_uploads/Byjg6Lshn.png =15%x) ![](https://hackmd.io/_uploads/BJTy2d332.png =13%x) * 解決：修改原本自己寫的 code，只讀取 tex 檔，因為此作法已可解決亂碼問題，所以不再使用 RedPajama 的 arxiv_cleaner.py ```python= with gzip.open(file_path, 'rt', encoding='ISO-8859-1') as f: try: with tarfile.open(file_path) as sub_tf: for member in sub_tf.getmembers(): if member.name.endswith(".tex"): if sub_tf.extractfile(member) == None: continue latex = sub_tf.extractfile(member).read() try: latex = latex.decode('utf-8') except UnicodeDecodeError: latex = latex.decode('latin-1', errors='replace') latex_content += latex except tarfile.ReadError: latex_content = f.read() ``` * 結果v3 >arXiv paper 200 多萬篇 ( 未做前處理 )，共佔 3.2T arXiv paper 200 多萬篇 ( 已做前處理 )，共佔 115G 過濾出電力相關有 8 萬多篇 ( 已做前處理 )，共佔 3.3G ## Merge json files 將過濾出電力相關 8 萬多篇的 json 檔案合併成一個 jsonl 檔案 * 程式碼 ```python= import sys,os import glob import json if __name__ == "__main__": folder_path = os.getcwd() target_dir = os.path.join(folder_path, "electricity") assert os.path.isdir(target_dir) files = glob.glob(os.path.join(target_dir,"*.json")) print(files) out = open("electricity.jsonl","w") for file in files: with open(file,"r") as f: data = json.load(f) out.write(json.dumps(data,ensure_ascii=False)+"\n") print("Finish") ``` * 格式如下 ```jsonl= {"id": "cond-mat0002097", "category": "cond-mat", "title": "Charge localization and phonon spectra in hole doped LaNiO", "author": "R. J. McQueeney, A. R. Bishop, and Ya-Sha Yi", "abstract": "The in-plane oxygen vibrations in La$_{2}$NiO$_{4}$ are investigated for \nseveral hole-doping concentrations both theoretically and experimentally via \ninelastic neutron scattering. Using an inhomogeneous Hartree-Fock plus RPA \nnumerical method in a two-dimensional Peierls-Hubbard model, it is\nfound that the doping induces stripe ordering of localized charges,\nand that the strong electron-lattice coupling causes the in-plane \noxygen modes to split into two subbands. This result\nagrees with the phonon band splitting observed by inelastic neutron \nscattering in La$_{2-x}$Sr$_{x}$NiO$_{4}$.\nPredictions of strong electron-lattice coupling in La$_{2}$NiO$_{4}$,\nthe proximity of both oxygen-centered and nickel-centered charge\nordering, and the relation between charged stripe ordering and the\nsplitting of the in-plane phonon band upon doping are emphasized.", "section": [{"title": "Appendixes", "text": "", "subsection": []}]} {"id": "cond-mat0004401", "category": "cond-mat", "title": "Particle dynamics in sheared granular matter", "author": "W. Losert, L. Bocquet,, T.C. Lubensky, \nand J.P. Gollub,", "abstract": "The particle dynamics and shear forces of granular\nmatter in a Couette geometry are determined experimentally. \nThe normalized tangential velocity $V(y)$ declines strongly with distance\n$y$ from the moving wall,\nindependent of the shear rate and of the shear dynamics.\nLocal RMS velocity fluctuations \n$\\delta V(y)$\nscale with the local velocity gradient to the power $0.4 \\pm 0.05$. \nThese results agree with a locally Newtonian, \ncontinuum model, where the granular medium is assumed to behave as a \nliquid with a local temperature $\\delta V(y)^2$ and density dependent\nviscosity.", "section": [{"title": "Acknowledgments", "text": "We thank A. Liu, H. Jaeger and C. Bizon for helpful discussions.\\nPart of this work was supported by the National Science Foundation under \\nGrant DMR-9704301", "subsection": []}]} ``` >**⚠ 有些資料欄位空白為正常現象，原因如下：** >1. 有少部分論文的格式不一致，因此無法取得該資訊。 >2. 該論文本來就沒有該欄位的資訊 ## 補充 ArXiv metadata set https://github.com/mattbierbaum/arxiv-public-datasets 有人整理好的 ArXiv 資料集( json )，但只包含以下項目，並沒有完整內文 * 格式如下 ```json= { "id"：ArXiv ID, "submitter"：論文是誰提交的, "authors"：論文作者, "title"：論文標題, "comments"：附加信息，例如頁數和數字, "journal-ref"：有關該論文發表的期刊的信息, "doi"：https://www.doi.org, "abstract"：論文摘要, "categories"：ArXiv系統中的類別/標籤, "versions"：版本歷史, } ```