20180314 會議記錄 ==== ## Development roadmap ### 「列出一個人傳過的文章」功能 - Ref: https://beta.hackfoldr.org/cofacts/https%253A%252F%252Fhackmd.io%252Fs%252FH1gj8XOIG - API: https://github.com/cofacts/rumors-api/pull/68 - MrOrz - Website: https://github.com/cofacts/rumors-site/issues/85 - Gore ### 送出文章時必須提供理由 * Ref: https://hackmd.io/s/SyivqlIrf#%E8%AC%A0%E8%A8%80%E5%9B%9E%E4%B8%8D%E5%AE%8C * API: https://github.com/cofacts/rumors-api/pull/69 (PR 中) - MrOrz * LINE bot: https://github.com/cofacts/rumors-line-bot/issues/58 - ggm * Website: https://github.com/cofacts/rumors-site/issues/86 - ### 「下列哪一篇是你查詢的文章」的相似度分析 > 判斷「下列哪一篇是你查詢的文章」時 > > - 選 0 的使用者,當時使用者看到的最高相似度是多少 > - 選了文章的使用者,他按的相似度是多少 > > 這也可以算 hit rate,(選了文章 / (選文章+選新增)) - Ref: https://beta.hackfoldr.org/cofacts/https%253A%252F%252Fhackmd.io%252Fs%252FH1gj8XOIG - 接近的文章:https://github.com/cofacts/rumors-line-bot/issues/53 - Hit-rate: 蝴蝶 - tune threshold: MrOrz ### 「文章相似」標準 - Ref: https://beta.hackfoldr.org/cofacts/https%253A%252F%252Fhackmd.io%252Fs%252FH1gj8XOIG - 「Proposal 1: 新增「不送出文章」的相似度」 - (ggm) - Orz 先 tune threshold (見上) 之後再評估是否還有很相近的文章一直被送進來的問題 ### 讓編輯改暱稱 - Ref: https://beta.hackfoldr.org/cofacts/https%253A%252F%252Fhackmd.io%252Fs%252FH1gj8XOIG - API: https://github.com/cofacts/rumors-api/issues/64 - 蝴蝶 - Web: TBD - (蝴蝶) ### Landing page - Ref: No. But should be ready ASAP, there are in conferences in April and May... - Web: (Orz) Template: https://github.com/g0v/grants-landing-template (Sample: https://g0v.github.io/grants-landing-template/ ) - No package.json, tech stack not compatible (jade + stylus + bootstrap) - serve static + use contents in hackfoldr - add google translate gadget https://translate.google.com/manager/website/ ### Build automation on Travis CI 在 schema revamp 時我有整理過 build process。由於新的 build process 會用到 `--build-args` 於是沒辦法再用 docker cloud 自動 build 了。不過,我們還是可以[改用 travis 來幫 build](https://docs.travis-ci.com/user/docker/#Pushing-a-Docker-Image-to-a-Registry)。 - API: TBD - Website: TBD ## Designing URL preview mechanisms ### Reason We have plenty of URLs in the text messages. Extracting text from these URLs can enrich our data in the database, thus increases query matching rate and adding materials to future machine learning possibilities. ### Text summarization performance https://docs.google.com/spreadsheets/d/1y1GGc04HBhpU76D6LvX5hqt877X_Rfatc1vSQNJmZ6Q/edit#gid=0 Should run a [puppeteer](https://github.com/GoogleChrome/puppeteer) instance in the API server so that we can resolve JS. ### Implementation No message queues are required. Use shell directly. #### References - https://github.com/cofacts/rumors-api/issues/41 - https://beta.hackfoldr.org/cofacts/https%253A%252F%252Fhackmd.io%252Fs%252FBJRJxopTb ### Proposal #### New elasticsearch index "urls" Includes a fetched record of the given URL. Fields: - `url`: exact URL found in the articles - `canonicalUrl`: The canonical URL fetched from the page. - `title`: Title of the page - `summary`: Extracted summary text using [Goose3](https://github.com/cofacts/rumors-api/issues/41#issuecomment-343528185) - `html`: Fetched raw html input - `topImageUrl`: Image URL for preview. Optional. - `fetchedAt`: Timestamp ### New fields in index "articles" and "replies" Add `hyperlinks` field as a nested object in `articles` and "replies", which includes the following fields: - `url`: extact URL found in this article - `canonicalUrl`: canonical URL fetched from the page. - `title`: Copy of title from `urls` index, for query purpose - `summary`: Copy of text from `urls` index, for query purpose #### User flow & data flow 1. When the user queries an article using `ListArticles`'s `moreLikeThis` filter, and the text contains URLs (check by [url extraction](https://github.com/kevva/url-regex)) 1. First check if the URL / canonical url already exists in `urls` index. If exist, add its "summary" to moreLikeThis text query. 2. If URL is not found, fetch & render the page using [rendertron](https://github.com/GoogleChrome/rendertron), insert new entry in `urls` index. Invoke goose3 to fetch "summary" and add to moreLikeThis text query. 3. Perform moreLikeThis query on `articles`' `text` and `hyperlinks.summary` and `hyperlinks.title` fields. morelikethis query's `like` can specify different index's field, thus "adding to query" can be implemented by providing ID of the urls index: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html 2. When a new article is submitted to the database, and the text contains URLs (check by [url extraction](https://github.com/kevva/url-regex)) 1. First check if the URL / canonical url already exists in `urls` index. If exist, add its "summary" and title to `hyperlinks` 2. If URL is not found, fetch & render the page using [rendertron](https://github.com/GoogleChrome/rendertron), insert new entry in `urls` index. Invoke goose3 to fetch "summary" and title and add them to the nre article's `hyperlinks`. 3. when a reply is being submitted, apply 2-1 and 2-2. 4. On the website, url preview pop-ups containing the url's title and part of the summary is shown in a box in the article and reply. (Design TBD. Ask Luciennnnn) 6. opt-in/opt-out options: `ListArticles` and `CreateArticle` can choose to always fetch latest page (which results in creation of new entries in `urls`), or only use the cached entries in `urls` (Speeds up the query) ### Discussions (silence) ## Line 進度更新: * 目前Mission Sticker的功能無法開放。 * 確定有官方帳號,再走一次一般的流程就可以完成。 待修正事項 * 貼圖團隊合約修正作法 * 由貼圖開發團隊上架,收益直接歸貼圖團隊 * 合約新增:貼圖上架、收益歸屬 * 貼圖贈送的方法 * 註冊新帳戶,直接用禮物傳給編輯 * [個人帳戶上架教學](http://rich01.com/linesop1/)