Day 3 中華高收專案 (Neo)

--- title: Day 3 中華高收專案 (Neo) tags: '課程筆記' description: View the slide with "Slide Mode". --- # Day 3 中華高收專案 (Neo) [TOC] ## 中華資料目的：產出Point提供給中華（半徑50m），方便中華提供「人」的Adid來做後續的cookie、個人資料連接 Neal：標籤2.0 呱吉：CTR優化 ``` import geopandas as gpd import numpy as np from shapely.geometry import Point import math import pyproj import shapely import shapely.ops as ops from shapely.geometry.polygon import Polygon from functools import partial import pandas as pd ``` ``` def sample_point(meter_coord, origin_coord): points = [] #先找出最高、最低、最左、最右 x_min, y_min, x_max, y_max = origin_coord.bounds #因為中華是每半徑50m的圓圈做查詢所以計算一下大概需要幾個r=50m的圓圈，*2是為了保險找多一點點 n = int(np.ceil(meter_coord.area / (50*50*3.14))*2) i=0 #因為是取範圍的，要先確認這些點有在這個地區的範圍內 while i < n: point = Point(np.random.uniform(x_min, x_max), np.random.uniform(y_min, y_max)) if origin_coord.contains(point): points.append(str(point)) i += 1 return points ``` ``` #透過geopandas把資料載入 df = gpd.read_file("VILLAGE_MOI_1090915.shp",encoding = 'utf-8') ``` ``` df.head() ``` polygon單位轉換成平方公尺 ``` #換算成中華面積計算的映射 df["meter_coord"] = df['geometry'].to_crs('epsg:3395') #原始的座標 df["origin_coord"] = df['geometry'].to_crs('epsg:4326') #用來留一個原始檔 gdf=df.copy() ``` ``` #載入政府收入資料 d = pd.read_csv("106_165-9.csv", encoding = 'utf-8') ``` ``` d.head() ``` ``` #取得縣市名稱、鄉鎮市區、村里、所得中位數 tax=d.copy().iloc[:,[0,1,2,6]] tax.columns = ['COUNTYNAME','TOWNNAME','VILLNAME','Median'] #將兩個資料集join在一起（為了讓中文村里名稱有TOWNNAME） result=gdf.merge(tax,how='inner',on=['COUNTYNAME','TOWNNAME','VILLNAME']) #根據中位數做排序 result.sort_values(by='Median',ascending=False,inplace=True,ignore_index=True) ``` ``` #選擇年收入中位數高於100萬的地區 result=result.loc[result.Median>1000,:] result.head() ``` ``` #利用apply方法將抽樣的點做出來 result['points']=result.apply(lambda x: sample_point(x.meter_coord,x.origin_coord), axis=1) result.head() ``` ``` result.head() ``` ``` result.points[0] ``` ``` result=result.loc[:,["COUNTYNAME","TOWNNAME","VILLNAME","points","Median"]] ``` ``` #將list拆開來 result=pd.DataFrame(result).explode('points') ``` ``` result.shape ``` ``` result.to_csv("tax_map.csv",index=False, encoding='utf-8') ``` ### 分散式運算意藍產出標籤太慢，全部標籤的更新最快才7天人x21天=4000萬筆 4000萬筆x60個標籤=24億筆用Spark做（分散式運算）- 可以更快 ### Neo 做過的小專案 ### Neo QA Model的東西從書上學架構比較好工程學習：從最頭到最尾練習短期建議：如何建立環境、建模型、打包API 再進一步：如何「自動化」更新Model、結果如何讓人使用