蔡易霖
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # 2022 搜尋TODO ~~【20220413】~~ ~~【20220420】~~ ~~【20220427】~~ ~~【20220504】~~ 【20220511】 [TOC] ![](https://hackmd.io/_uploads/Hk50yLv9h.png) ![](https://hackmd.io/_uploads/SJw7bIw93.png) ## ~~紐酷樂demo: 0 week~~ 1. ~~(力揚) 開mysql給易霖連: DB上線,資料表持續更新~~ 1. ~~(力揚) TN batch API: 初版~~ 2. ~~(力揚) solr2elsatic服務開啟: done~~ 3. ~~(易霖) elastic TS~~ - ~~長文搜尋邏輯~~ - ~~增加時間filter~~ 4. ~~(易霖) elastic index~~ ## ~~共通 1 week~~ 1. (*力揚) ~~QII-jar~~ + ~~QII-rest~~ 動態多欄位(多模型) - ~~FET TS 也要改.~~ - 後續補充: 多值處理 1. ~~(*豪仁) log rotate~~ - ~~play logger 指向 console~~ - ~~non play java logger 一樣寫file+內部rotate~~ - ~~四個服務(報表): 服務內部自己刪除過舊的log.~~~ ## (五月底) 遠傳一期交付 1. (力揚) ES相關服務放到COMPOSE 1. 主流程沒問題,模型更新要再確認 3. (力揚) 遠傳SIT測試(after docker-compose) - ~~(易霖) 壓測: query log to jmeter plain text~~ - (力揚) 開啟置頂shelf,增加少數幾個置頂文章 - (易霖) 測試TSQR排序 5. 安裝手冊 - 含索引備份還原 6. 內部部署操作流程(內部CICD) 5. ~~(力揚) TSQR-REST紀錄log~~ 3. ~~(*力揚) 熱門詞模組分成四個tag.~~ 1. PR已merge 5. (*力揚) 關聯詞處理四個tag. (週四回覆狀況) - 考慮分四個容器分別train不同tag欄位 7. (*紹哲) spider(豪仁初版資料) 寫到 MySQL - ~~登入之後的網站要爬~~ - ~~content pattern (要繼續觀察)~~ - 404 問題等FET回覆 - ~~要寫update_time欄位~~ - ~~sitemap~~ - ~~APIs 要爬~~ - ~~若資料已存在則更新到DB,若資料不存在則增加到DB~~ - ~~HTML to String~~ 1. (pending) 分析掃描哪裡要改(前端),並修改程式. - error log handling - ~~(五月前) API 500 狀況不能顯示code info.~~ - ~~TSQR/Autcomplete/Hotkeyword~~ - (五月後) 統一用logger寫log. 1. ~~(力揚) error correction (第一期用DB資料訓練LM)~~ - 一期:從DB訓練LM model (使用2000筆跑出一份model,測試正常) - DB reader(postgresql 改 mysql) 3. (*力揚) 安排與FET的GA code插入教學. - 遠傳決定與既有GTM共用,正找後端單位設定 - 下周前給回覆 2. (豪仁) GA --> big query --> BM 排程 ---> [KPI log] - big query 語法要改 - 欄位 - 預設 date (**力揚表示從擷取頻率從每天一次,需改為每小時一次,此需求之前未確定(20220530)**) - 衍伸問題,目前log file 都是以每天一個檔案,若頻繁每天一小時都抓取一整天的,可能會衍伸大量存取 - 只改讀取查詢頻率: - 會增加重複讀取之資料 (衍伸提高 GCP 資料提取費用) - - 可修改方案一:從bigquery下載都採用 append 方式匯入檔案,但檔案仍以日為單位(影響 Biguqery 下載的元件) - 調整 table 過濾語法 - 調整查詢條件語法 - 調整寫檔邏輯 - 可能排程不成功後,若要重跑,不能以這個來跑 - 可修改方案二:所有 log file 都以 hour 為單位。(影響所有讀 log 檔案的元件) - 取得BQ的key - 力揚說上 git 1. ~~(豪仁) container --> 1 week (完成後端部分,除了UI) (BM抽到剩下index POST功能 + sync containers (model/data) + health check + autoheal container + qii-jar-like done file)~~ - ~~缺收尾 (POST script) + 邏輯 (約剩2天)~~ 1. ~~(力揚) FET MySQL schema設計完成~~ 1. https://docs.google.com/spreadsheets/d/1D5WQDSBzT8JIKH48-aARdDXdMYJ9WifmvglXZg5Rj3Q/edit#gid=25672902 3. ~~(力揚) 遠傳 oracle -> ITRI MySQL (一天)~~ - oracle架起來,table schema建起來.: 已匯入 - [尚未完成] **ITRI MySQL table1 (爬蟲專用) --(考慮爬蟲time)---> table2(oracle+爬蟲)** - 要寫ETL_UPDATE_TIME欄,判斷spider更新資料的時間點 1. ~~(*力揚) 查詢(TS+TSQR) java version. (三天)~~ - ~~客製化排序(channel boost, ...~~ - ~~**易霖力揚討論完畢,結論:使用multi sort**~~ - 剩下發PR 1. ~~(*力揚) 索引(index) python version. (改設定檔)~~ 1. ~~(*易霖) ES backup/restore~~ 1. ~~(力揚) shelf DB, shelf-download-txt-REST~~ - ~~shelf download("xxxx/download-json")輸出格式:~~ ``` [ "/heal/", "http:" ] ``` 1. ~~(易霖) ElastixInddex read shelf-download-txt-REST --> 判斷哪些文件要如何產生bool--> index~~ 1. ~~(易霖)欄位PageTitle bug~~ 1. ~~(易霖) muti-sort~~ 1. ~~(易霖) 索引加上紅色dongle授權~~ 1. ~~(易霖) FET 新詞放到詞庫交付~~ - `\\140.96.111.122\Team_M\M300\01_專案清單\SearchEngine\FET\廠商提供資料\新詞擷取\fet.txt` 1. 爬蟲問題 1. ~~單頁要設定timeout(單頁無限自我導向、下載逾時、..)~~ 2. ~~FET_login 花了 12 小時的問題解決~~ 3. (力揚) 遠傳ban ip問題 - 可以問,但應該以後一週跑一次比較不會被ban ## (六月底) 紐酷樂交付 整體進度 ``` - [Done] ETL(人工對表、程式撰寫) (100%已完成): 20 人天 - [Done] 詞庫生成(新詞擷取) (已完成): 4 人天 - [Done] k8s研究 (100%完成): 7 人天 - [Done] 導入workflow (已完成): 5 人天 - 報表log蒐集 (80%已完成): 5 人天 - [Done]剩下一些format要改 - rotate還要做 (下週) - [Done] 瀏覽器GA log蒐集(已完成): 5 人天 - [Done] es、kibana導入 (已完成): 5 人天 - [Done] TS查詢邏輯改善、highlight、phrase search、垂直搜尋、boosting (已完成): 10 人天 - [Done] 索引程式(已完成): 10 人天 - 後台 (0%): 5 人天 (本週完成至少一個, 下禮拜完成) - [Done] PHP container 化 - BM、prefect...等ip:port要參數化 - [完成比例: 3/5] 報表 (本週) - [尚未完成]警訊 (本週) - [Done] 即時上下架:流程變為:PHP(or BM)->更新Shelf->呼叫ESIndex跑增量索引 - [尚未完成] 索引詞庫(TN reload) - [尚未完成] 強制轉換(TN reload) - [尚未完成] 關聯詞(merge, reload rest) - [Done] 同義詞(es_reload_analyzer) - [尚未完成] TC設定(TC reload) - [尚未完成] 關鍵字排除(TCQR reload) - [尚未完成] 熱門詞排除(hotkeyword reload) - 訓練job - [Doing] QII train - [尚未完成] RW train - [尚未完成] EC train - [尚未完成] hotkeyword train - [尚未完成] autocomplete train - 關聯詞 (下週) - [Done] 關聯詞PMI for demo(100%): 3 人天 - 關聯詞PMI與log整合(85%): 8 人天 - [Done] 關聯詞train from log(for demo) (100%): 3 人天 - 格式說明:每行是一組key-values的pair,以tab分隔,第0個是key,第1個以後是value(按照關聯度排序,越左越高分) - 例如:`阿里山\t旗袍\t大同水上樂園\t嘉南大圳`,代表查詢阿里山會得到top3結果:旗袍、大同水上樂園、嘉南大圳 - 不額外長出更多關聯 - 部署於GKE - 索引時動態開啟: tn(for index)/es(for index)/fullindex - 暫定node標籤: index:true / index:false - 部署: 6 人天 - 系統調校,抓出每個pod合理的請求資源: 3人天 - 雲端壓力測試 - GKE 300QPS VM 規格 (已有初版) - 測試Fuzzy模式 - TS加入Highlight語法 - Cluster版ES (參考FET版,待補) - 壓測要增加"搜尋全部"的query,並把row設為0 - 正確率標記 - 說明MAP - 連同效率測試、案例分析 - 增加查詢過濾關鍵字功能 - [Done] (易霖) TS - [Done](力揚) TSQR - (力揚) etl - 解決MSSQL刪除後,MYSQL沒有跟著刪除的問題 - 跑之前先清空 - 要注意ETL過程中不能有訓練任務 - TN : Exception :java.io.IOException: Stream closed - 建議先記錄詳細log & 改 healthcheck(優先) - RW : 設定檔產生log exception訊息處理 - API調整需求: search api欄位初步規劃為:來源系統(source),來源資料ID(pk),標題(title),描述(description),代表圖像url(image url) - (1)代表圖像的提供 - (2)api欄位名稱是否可以調整為駱駝峰寫法 - 400萬筆中,還有不少是格式錯誤無法索引的文件 - prefect job 同時執行的問題 總工時共100天 未完成: 2+1.05+1.5+5+0.5+11+9=30.05 已完成比例 = 1 - (30.05 / 100) = 0.7 ``` (力揚): smtp 客戶是否可提供? 客戶問: ~~1.原本系統需求的VM是否確認全改為Container~~ 2.是否能降規並透過auto scaling來動態擴展 > (力揚) 請力揚給最低規格 3.DB確定選用 postgre SQL? 是否能能與其他系統共用 MS SQL > (力揚) 內部選用MySql,看要不要代管,查一下Mysql的價錢 ~~4.cloud armor/ cloud IDS 是否能與其他專案共用~~ 5.服務有用到cloud storage,這部分無估算到,預估每月需使用多少空間? > registry空間費用預計多少? log 空間? 主要想知道整體空間的價錢 --- 2. ~~(*紹哲) PMI關聯詞~~ - ~~問雅筠model為什麼不是query-pair?~~ 4. (力揚) error correction - 有log,可以train了 5. ~~(*豪仁) 釐清BM更新資料流,了解有哪些服務需添加call script流程~~ - 以非排程為主. 6. ~~(*易霖)(前端服務)還沒有串流程的K8S yml~~ - 應該是以 GCP 環境為目標? 7. (易霖)~~k8s + workflow 更新模型~~ 1. ~~(力揚) error correction / autocomplete / related word / qii-jar / hot keyword 等的 log reader 要改.~~ - **重要但不緊急,目前先csv+config** - 設計一個好的log格式 (e.g. JSONL or 至少欄位資訊由外部決定) - {"item_id": "xxnfjdg", ...} - QII模型準備: 問易霖是否要重train - 熱門詞統計 1. ~~**(力揚) Log 蒐集(我們自己的) -> 報表用**~~ - Fluentd container <= on - stdout + fluentd driver - 測試一下短期任務log是否能蒐集到,若不能也沒關係,短期任務通常是prefect觸發的,而prefect也會蒐集log - 蒐集後的log要做rotate - (易霖) TS 斷詞紀錄要改到stdout - 警訊功能 1. ~~後台UI功能~~ 1. ~~系統狀態->不呈現,改以k8s管理介面(lens)呈現~~ 2. ~~來源資料表管理->屬於維運功能,不呈現在UI~~ 3. 上下架:流程變為:PHP(or BM)->更新Shelf->呼叫ESIndex跑增量索引 4. 將舊有BM移除不需要的功能(1.cronjob移出 2. 所有ssh & scp行為) 1. 只保留REST控制後續行為 2. ex: UI->BM->allserviceconf->自動觸發後續的sync ## (八月底)FET 二期交付 1. autocomplete / related word / qii-jar 的 log reader 要改. - 設計一個好的log格式 (e.g. JSON or 至少欄位資訊由外部決定) ## 詢問FET 1. ~~(力揚) 爬蟲是否可以爬API: FET討論 可以小量打~~ - https://docs.google.com/spreadsheets/d/1s-GvEYPhNwc3VEFEmpuSxJMW104UCI9AJoTKh-7e3HI/edit#gid=2037129788&range=A17 2. (力揚) dongle 可以插嗎: **NO** - 提供兩種License方案,待FET回覆 - 原則上會採用第一種,實體主機FET與infra確認 - (易霖)若採用紅色dongle,則ElasticIndex要開dongle設定 - 若失敗,記log 4. (力揚) 詢問爬蟲404但有文章的狀況該怎麼處理?: FET確認=>屬正常結果 - 補問: 404但正常的URL清單可以提供嗎? - 網頁本身404屬不正常,FET回去確認 5. ~~(力揚) 詢問爬蟲是否該過濾個人資訊(例如帳單內容),若要過濾的話,該如何判斷哪些資訊要過濾?: eService下統一只撈meta~~ 6. (力揚) 關聯詞: 可以開始蒐GA了嗎? 目前前端沒紀log, 關聯詞沒辦法訓練 - **另需要將log轉換出點擊率加進MySQL內** 7. ~~(紹哲/力揚) 紹哲把爬蟲邏輯給力揚,請FET確認邏輯與筆數~~ - FET說交付後他們再測試 9. ~~(力揚) 詢問VM是用哪套VM?~~ 10. ~~(力揚) 登入 reCAPTCHA 問題~~ - ~~FET回覆: 只有持續登入失敗才會跳出,正常應不會~~ - ~~是否舊版也會跳reCAPTCHA?~~ - 不透過新版登入介面 ## 詢問紐酷樂 1. 可以先開GKE給我們用嗎? 1. 之後點擊記錄方式是否修改? (問青憲) ## NOTE - apache as reverse proxy - 大概沒問題,[做法看這篇](https://blog.davidou.org/archives/1334) - TN 速度改善 - access variety (新詞演算法) - (紹哲) 紐酷樂斷完詞取bigram,看看top freq結果表, bigram-bigram PMI ## FET 需求確認 - 後台搜尋log: 每筆曝光數是否紀錄?(第二階段) - 搜尋完後多久反映到關鍵字報表?(第二階段) - 整理每次會議提出的新需求,討論 ## 開發優化 - 元件Image自動發佈、上傳Harbor工作加入Jenkins

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully