如何抓取地籍資料？

如何抓取地籍資料？ === # 簡介本程式為[地號-GeoJSON API](https://twland.ronny.tw/)網站之批量爬蟲，該網站提供「單筆地號輸入→單筆GeoJSON輸出」之簡易功能，本程式於此之上提供單次大量的資料擷取，適合需獲得一地段、一鄉鎮之較大量且非即時地籍資料(見註2)的使用者。擷取之GeoJSON檔可進入各類GIS軟體進行轉換(如SHP、KML等)，是非常簡易操作的資料格式。 *註1：本程式為本系同學顧永平與台大資工系同學陳柏衡共同開發* *註2：該資料源為2016年之資料，可能隨地段重劃、土地整合或分割等有所變動* ## 環境 python3.7 module:requests,json,pandas,tqdm ## 事前準備 MS系統建議以VS-code為編輯器建議開一個新資料夾，內有： 1. 本程式檔案 2. 地籍資料之CSV檔 **於首列左方五欄依序命名為：縣市、行政區、地段、地號、公告地價** 地號欄應以八碼格式輸入，以a-b地號為例，地號輸入格式為a×10000+b。公告地價欄首列必須為「公告地價」，資料內容不需真實對應，為一般整數(interger)格式即可。以單次抓取下列4筆地號為範例準備CSV檔： 1. 高雄市鼓山區鼓南段一小段1地號 2. 高雄市三民區雄中段93地號 3. 臺南市東區東光段527-2地號 4. 高雄市苓雅區林德官段二小段3347地號 * 應備齊之CSV檔如下表所示： | 縣市 | 行政區 | 地段 | 地號 | 公告地價 | | ---- | ------ | ---- | ---- | -------- | | 高雄市 | 鼓山區 | 鼓南段一小段 | 10000 | 88888 | | 高雄市 | 三民區 | 雄中段 | 930000 | 66666 | | 臺南市 | 東區 | 東光段 | 52700002 | 555 | | 高雄市 | 苓雅區 | 林德官段二小段 | 337400000 | 5555 | *註3：若爬蟲結果為空(失敗)，可先自爬蟲來源網站[地號-GeoJSON API](https://twland.ronny.tw/)依其要求查詢其是否存在於資料庫內，若無法顯示則代表該地號無法透過本爬蟲程式抓取地籍資料。* *註4：嘉義市由於地政業務劃分較為特殊，其無「東、西區」之行政區劃分，第二欄行政區請填寫「嘉義市」。* # 爬蟲程式碼使用方式首先打開爬蟲程式，並新增一終端機以輸入指令操作。 (快捷鍵：Ctrl+Shift+ ` ) ### 1. 載入必要模組 ``` import requests import json import pandas as pd from tqdm import tqdm ``` 初次使用python爬蟲者應當進行模組安裝作業，在終端機中輸入「pip install+模組名」即可進行模組安裝。 ### 2. 簡易互動指令，input輸入與輸出之檔名。首先輸入指令"CD"，將位址確立於與程式碼位址相同之資料夾內。如程式檔案和與先備檔案位於D槽，即輸入指令：`cd D:\` 接下來在編輯器中運行程式，如程式碼檔名為`crawl.py`，輸入指令 `python .\crawl.py` ※HINT：打`python`之後，按TAB可以切換檔名。開啟程式後，終端機內應會出現以下操作指示： `enter input file name :` (輸入後按Enter) `enter output file name :` 輸入欄位請填寫預設之csv檔名，如**高雄.csv**。輸出欄位請填寫欲產出之json檔名，如**高雄.json** 之後按enter檔案就會出現，等進度條跑完檔案就會出現在你的資料夾裡面了。 ``` input_file = input("enter input file name : ") output_file = input("enter output file name : ")` ``` ### 3. 產出之資料格式，列出JSON的所需部位模組。該段程式碼是預設一JSON資料格式，在抓檔案下來的時候可以符合他的格式。 ``` output = { "type": "FeatureCollection", "name": output_file, "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } }, "features": [] } ``` ### 4.指定地籍爬蟲網站位址這段程式碼是指定你要爬蟲對象的網站，這邊感謝ronny大大開發出這套網站。 ```url = 'https://twland.ronny.tw/index/search?'``` ### 5.讀入電腦本機的CSV上傳至指定網站 ``` with open(input_file, "r") as csvfile: raw_datas = pd.read_excel(input_file, dtype=str) datas = raw_datas.values.tolist() count = 0 for data in tqdm(datas): num = int(data[-2]) if num % 10000 == 0: url += 'lands[]=' + ''.join(data[:-2]) + str(num // 10000) + '號' + '&' else: url += 'lands[]=' + ''.join(data[:-2]) + str(num //10000) + '-' + str(num % 10000) + '號' + '&' ``` ### 6.設定資料抓取速度預設為30，建議可於10-100區間調整。 ``` count += 1 if count == 30: count = 0 url = url[:-1] r = requests.get(url) output['features'] += r.json()['features'] url = 'https://twland.ronny.tw/index/search?' ``` ### 7.程式運行完成。 ``` if url != 'https://twland.ronny.tw/index/search?': url = url[:-1] r = requests.get(url) output['features'] += r.json()['features'] ``` ``` # print(output) with open(output_file, "w") as f: json.dump(output, f) ``` ### 8.以下附上完整程式碼 ``` import requests import json import pandas as pd from tqdm import tqdm input_file = input("enter input file name : ") output_file = input("enter output file name : ") output = { "type": "FeatureCollection", "name": output_file, "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } }, "features": [] } url = 'https://twland.ronny.tw/index/search?' with open(input_file, "r") as csvfile: raw_datas = pd.read_excel(input_file, dtype=str) datas = raw_datas.values.tolist() count = 0 for data in tqdm(datas): # print(','.join(data)) num = int(data[-2]) if num % 10000 == 0: url += 'lands[]=' + ''.join(data[:-2]) + str(num // 10000) + '號' + '&' else: url += 'lands[]=' + ''.join(data[:-2]) + str(num // 10000) + '-' + str(num % 10000) + '號' + '&' count += 1 if count == 30: count = 0 url = url[:-1] r = requests.get(url) output['features'] += r.json()['features'] url = 'https://twland.ronny.tw/index/search?' if url != 'https://twland.ronny.tw/index/search?': url = url[:-1] r = requests.get(url) output['features'] += r.json()['features'] # print(output) with open(output_file, "w") as f: json.dump(output, f) ``` ###### tags: `PIC_Gadget`