使用Pyserini建立反向索引-資料檢索(HW3)

# 使用Pyserini建立反向索引-資料檢索(HW3) ## 作業規定下面是老師給的作業規定說明：使用Pubmed200k (The PubMed 200k RCT dataset is described in Franck Dernoncourt, Ji Young Lee. PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts. International Joint Conference on Natural Language Processing (IJCNLP). 2017.) 資料集可從 https://www.dropbox.com/s/miyb2awm2esrcpk/pubmed%20220%20train.txt?dl=0 下載。使用Pyserini (Follow https://github.com/castorini/pyserini#how-do-i-index-and-search-my-own-documents 中的說明) 建立文章等級的反向索引。範例輸入查詢 "cancer" 回傳：{id, contents} 29de267408dcfcb1ab8f9711c1ccb61f.png 需求：比較暴力搜尋法(i.e., linear scan)與使用反向索引(retreive by inverted index)所需的時間差異。繳交：程式碼 ## 安裝問題 ### No module named 'faiss' 這個部分是比較有問題的，因為我的cpu是m1 pro不知道為啥用了好幾個裝法都裝不起來，最後我是用下面這種方式解決問題 ``` pip install faiss-cpu --no-cache ``` ### java 還有因為Pyserini底層有call到java所以也要把java裝好 ## 實作步驟老師要求的項目有兩個 1. 暴力法 2. 反向索引表(Pyserini) ### 將檔案轉為Pyserini要的格式 Pyserini可以幫助你建立反向索引表，他有規定的三種檔案格式首先你要先將你的文章以及id使用以下的格式 ``` { "id": "doc1", "contents": "this is the contents." } ``` 檔案的放至可以參考他github的三種方法 1. Folder with each JSON in its own file, like [this](https://github.com/castorini/pyserini/tree/master/tests/resources/sample_collection_json). 1. Folder with files, each of which contains an array of JSON documents, like [this](https://github.com/castorini/pyserini#how-do-i-index-and-search-my-own-documents). 1. Folder with files, each of which contains a JSON on an individual line, like [this]([https:/](https://github.com/castorini/pyserini/tree/master/tests/resources/sample_collection_jsonl)/) (often called JSONL format). 基於txt的讀取方式我不熟，所以我選擇第一種方式比較簡單程式碼如下 ``` python import json line = f.readline() contents = '' count = 0 while line: print(count) if line[0] == "#" and line[1] == "#" and line[2] == "#": id = line.strip() id = id.replace("#", "") elif line.isspace(): contents = contents.replace("\t", " ") output = { "id": id, "contents": contents } id = "" contents = "" name = "./data/output" + str(count) +".json" count = count + 1 with open(name, "w") as t: json.dump(output, t) else: line = line.strip() contents += line line = f.readline() ``` 說明因為檔案裡面是規律，有以下幾個點! [Uploading file..._p5kyub3hi]() * 每篇論文開頭都是###再加上id * 結束的時候都會有一行空白所以程式碼架構就是先將txt檔逐行讀取，若是開頭為###就認定文章開始。這邊先處理id的部分，之後開始處理content的部分，中間的if是當發現這行爲空時，這時候為文章的尾端，我們要做兩件事情 1. 將id及content寫為字典，將它寫出為json格式 2. 將id 清空以及content清空 ### 建立反向索引表上一步做好之後就可以建立自記得反向索引表了這邊如果將input改成自己的檔案路徑不知道為什麼會出錯，所以我是直接將我的檔案放到他指定的路徑啦（就直接建立一模一樣的資料夾） ``` python -m pyserini.index.lucene \ --collection JsonCollection \ --input tests/resources/sample_collection_jsonl \ --index indexes/sample_collection_jsonl \ --generator DefaultLuceneDocumentGenerator \ --threads 1 \ --storePositions --storeDocvectors --storeRaw ``` 出現下面的結果就是成功了 ``` 2023-03-09 12:00:46,852 INFO [main] index.IndexCollection (IndexCollection.java:633) - Indexing Complete! 190,654 documents indexed 2023-03-09 12:00:46,852 INFO [main] index.IndexCollection (IndexCollection.java:634) - ============ Final Counter Values ============ 2023-03-09 12:00:46,852 INFO [main] index.IndexCollection (IndexCollection.java:635) - indexed: 190,654 2023-03-09 12:00:46,853 INFO [main] index.IndexCollection (IndexCollection.java:636) - unindexable: 0 2023-03-09 12:00:46,853 INFO [main] index.IndexCollection (IndexCollection.java:637) - empty: 0 2023-03-09 12:00:46,853 INFO [main] index.IndexCollection (IndexCollection.java:638) - skipped: 0 2023-03-09 12:00:46,853 INFO [main] index.IndexCollection (IndexCollection.java:639) - errors: 0 2023-03-09 12:00:46,857 INFO [main] index.IndexCollection (IndexCollection.java:642) - Total 190,654 documents indexed in 00:01:01 ``` ### time模組使用time模組來檢測測試速度 ``` python start = time.time() #要測試的程式碼 end = time.time() print("執行時間：%f 秒" % (end - start)) ``` ### 使用反向索引search Pyserini有提供search的函式，他在建立反向索引表的時候會有score的機制。預設是回傳前10名可以透過參數k來更改，我設那個數是一定達不到啦所以會全部回傳 ``` python from pyserini.search.lucene import LuceneSearcher import time searcher = LuceneSearcher('indexes/sample_collection_jsonl') start = time.time() hits = searcher.search('cancer', k=300000) end = time.time() #for i in range(len(hits)): #print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f}') print("執行時間：%f 秒" % (end - start)) ``` ### 實作暴力法程式碼 ``` python import os import json searchKey = 'cancer' folder_path = './tests/resources/sample_collection_jsonl' list = [] start = time.time() for filename in os.listdir(folder_path): if filename.endswith('.json'): file_path = os.path.join(folder_path, filename) with open(file_path, 'r') as f: json_data = json.load(f) #json_data['contents'] if searchKey in json_data['contents']: list.append(json_data) end = time.time() print("執行時間：%f 秒" % (end - start)) ``` 說明如下將整個資料夾的json都讀進來，然後去檢查要的關鍵字(key)有沒有在裡面。 ###### tags: `資料檢索` `作業` `Pyserini` `反向索引`