# PyCon TW 2016 Collaborative Talk Notes <br> Day 3 - R1 > ### Quick Links > - [Portal for Collobration Notes 共筆統整入口](https://hackfoldr.org/pycontw2016) (hosted by [hackfoldr](https://hackfoldr.org/about) and [HackMD](https://hackmd.io/)) > - [Program Schedule 議程時間表](https://tw.pycon.org/2016/events/talks/) > - [PyCon TW 2016 Official Site 官網](https://tw.pycon.org/2016/) > > ### How to update this note? > - Everyone can *freely* update this note. 任何人都能自由地更新內容。 > - Please respect all the participants and follow our [code of conduct](https://tw.pycon.org/2016/about/code-of-conduct/) during discussion. 討論、記錄時,請遵守大會的[行為準則](https://tw.pycon.org/2016/about/code-of-conduct/)。 ## Talk: 如何打造關鍵字精靈 - Info: https://tw.pycon.org/2016/en-us/events/talk/57694625669840919/ - Slider: http://www.slideshare.net/ssuser05afc89/how-to-build-an-keyword-wizard - Speaker: 施晨揚 #### What is keyword 是一個有指標或是有識別性的字詞,且他也包含著一些特定的意義 #### Why we need ? - Advertisement (廣告) - TAG (標籤) - Relation (關聯性) - Article Summary (文章的總結) #### Word Relation Model 關聯性搜尋 * Model 1(關聯詞) 沖繩 -> 飯店、自由行、推薦 * Model 2(同義詞) 沖繩 -> 琉球、壺屋通、... #### Word Representation - Vector Space Model 把文字、文章 Mapping 到多維向量空間 可以看出那些文章,或是哪些詞是有關係的 #### One Hot v.s Continue Value 如果是維度十分高的話(多維空間),是很難辨識出哪些詞是相似的 #### Word Representation -One Hot Representation 最簡單的方法 - One Hot Representation 先把每一個詞建出一個 `One Hot Index` 但是這種編碼模式會找不到詞與詞之間的相關性,找關係會很難找 #### Word Representation - Context Vector 在範例中,以詞作為 X Y 軸來產生一個表格, 把兩個詞之間同時出現的機率來辨識出兩個詞之間的相關性 Ex. 沖繩 vs. 浮淺 = 0.7, 沖繩 vs. 餐廳 = 0.1 #### Word Context Vector 講到拉麵 -> 美味しい 講到一蘭 -> 喔依西捏 就可以把兩個詞關聯起來 > 可是一蘭和赤坂明明都不好吃啊 > 別醬子 XD #### Co-occurrence Matrix 如果很大的話 n ~= 500k 那 space = n & n,time = n *n #### Word2Vec word2vec = 兩層式的類神經網路 「我想要去沖繩 ... 潛水」必須再看到前面的字就要能預測出會說 `潛水`, 可能的詞有:打球、潛水、睡覺、...、洗臉(可能有好多個 Label), 可以用類神經網路來逼近出這個 Model, [Reference](https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html) #### Major Process Flow 1. Article Selection 2. Content Extraction 3. Word Cutting #### Article Raw Data Preparation 文章都是一行,要幫文章做斷詞,把文章中的詞以空格隔開。 #### Term Database 收集詞庫 - Search Log - 各大電商網站(e.q 阿里巴巴) - Link1 - Link2 - http://baseterm.com/ - 輸入法詞庫 - 詞庫 破解 #### Term Database - Search Log `google search sole` → `search histroy` →`Filter & Counting` →`Term Collection` #### Search Log 從 search log 產生詞庫,可以直接用 count 來做, 累積到一定的數量,就可以知道 `太陽的後裔` 是一個新詞 但是可能會有奇怪的詞混進來,所以要限制長度 #### Term Database #### Word Cutting - Word Cut Tool - Jieba - Get Bot Token > 推結巴,好用 ## Talk: First try for CAS, SymPy with codegen - Info: https://tw.pycon.org/2016/en-us/events/talk/58534680193925150/ - Speaker: Chiu-Hsiang Hsu - Slides: https://speakerdeck.com/wdv4758h/first-try-for-cas-sympy-with-codegen ### Introduction + `Sympy`可協助數學運算 + 建symbol、expression + `symplify`可直接代入運算式得結果 + `expand`展開 + `solve`解方程式 + `lambdify`產生可運算程式碼 + 可以接到各種語言的backend 像fortran、numpy... chebyshev Approx... > 沒抄到 orz ### SymEngine + C++寫的sympy ## Talk: Geo processing with Python: How to convert, clean, aggregate and compress your geo-data for web - Info: https://tw.pycon.org/2016/en-us/events/talk/69816036404232254/ - Speaker: Juha Suomalainen ### About - WiredCraft 架構/工程師 - 主要工作為geo資料視覺化 ### what drives me - Visualizations - UX - Engineering 希望把所有東西做的簡單,好看 ### Projects working on - CO2 visualizaion - data.worldbank.org - Flood Risk ### Technologies #### Geo Data Format - shapefiles (GIS 格式) - dfb: shapes - prj: coordinate - shp: main entrypoint - shx: index file(?) #### Shape formats for WEB - geojson (simple, standard json)-https://github.com/geojson/draft-geojson - topojson (more compact, boarder sharing)-https://github.com/mbostock/topojson #### Tools - QGIS (desktop app) - Geojson.io (web app) - mapshaper.org (feature simplification) - mapbox.com (basemap creation) - js: - leaflet.js - mapbox.js (propertory, speaker has good user experience) - d3.js (customise, low-level APIs) -可參考 http://www.taiwanstat.com/ ### simple approach 1. shapefiles and api 2. data processor 3. geojson / json 4. webapp #### Frontend - load data - basemap - stylethe features - create the ranges #### Common pitfalls - Data encoding - Coordinate systems - Check the mappings #### Optimizing for web * file size is critical * use [topojson](https://github.com/mbostock/topojson) to save space * simplify the features with [mapshaper](http://www.mapshaper.org/) #### Optimizing choropleth * play with border styling * make it interactive * try differenet color schemes #### 如何取得data? County open data OR JSON Api * [Natural Earth](http://www.naturalearthdata.com/downloads/) * [GDAL](http://www.gadm.org) * [World bank open data api](http://data.worldbank.org/developers?display=) * [TGOS](http://tgos.nat.gov.tw/tgos/web/tgos_home.aspx) * [NGIS](http://ngis.nat.gov.tw/) #### Resources * Formats: Shapefiles, Geojson, Topjson * Python packages: pyshp, geojson, topojson * Sites: 前面介紹的那幾個 * Tools: QGIS * [Wiredcraft blog]( https://wiredcraft.com/blog) #### Content on UX * interactivity * colors + styling * usability with mobile devices * talk to ther users #### example * 接續上面的 simple approach * 修改自 mapbox tutorial (?) * mapbox 讀 python 生出來的 geojson * `getColor` 用分數決定顏色