owned this note changed 4 years ago
Linked with GitHub

語言模型訓練悲慛實錄 - 李振維

歡迎來到 MOPCON 2020 共筆

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

共筆入口:https://hackmd.io/@mopcon/2020
手機版請點選上方 按鈕展開議程列表。

從這開始

nununi

  • 主要用來解決 NLP 相關問題。
  • 是一個 AI 產品。

auto tagging service

  • 針對產品的理解找出重要標籤。

  • Embedding

    • Data preprocessing
    • Optimization tips for training embedding model
    • Two embedding models
    • Comparing Google BERT and awoo's embedding model
  • Our base structure

    • attention mechanism
  • Auto daily retraingin process

  • Distributed for training for

Embedding

  • 文字向量化

Data Preprocessing

  • 停用詞 -> 斷詞 -> embedding training

Optimization tips (1): Filter useless data

  • For data source:
    • use "regular expression", e.g. product id, service
  • For embedding model:
    • Using stop-word mechanism to filter

Optimization tips (2): Stop-word mechanism for training

In our business case, we keep one trading-sample pre stop-word.

  • Benefits:
    • We get all the embeddings we need, including stop words.
    • Other words won't be badly affected by the stop words.

Two embedding model

Char-Level Word-Level
Pros New characters and words can be processed For different words, there are big differences due to training
Cons Similar descriptions have similar vectors 1. Must have dictionary 2. New words problem

awoo Embedding v.s. BERT

  • Google BERT: 768 dimensions
  • awoo Embedding Model: 128 dimensions
  • LM(structure) is our different new language model structures.

是什麼造成數據上的差異?

  1. BERT 的訓練集跟 awoo 的訓練集不太一樣
  2. BERT 不太有產業之分

Normal Language model

  • RNN / LSTM
  • Bi-RNN / Bi-LSTM
  • Self-attension

Attention mechanism(2)

  • Why don't we need to use "Positional Encoding" in our model?

Multi-Industry language model

  • General -> Makeup > Fumiture
  • 而會分產業的問題最主要是因為每個產業的關鍵用詞不一樣。若說不分產業而將向量及特徵放進去則會讓其失去用詞特性。

Graph Understanding Model

Automatic daily retraining process

Others

  • 自建的斷詞詞庫,改良自某詞庫。改良方式為商業機密。

聊天區

聊天OAO

各位 最後一場了 OvO

白龍真可愛
他是塔矢亮
不是!!!QQ是白龍
一直喊 Sai 的那位
笑死XDD

白龍:找我?我只會下棋。

所以我說,千尋呢?

千尋:叫我?

千尋,你怎麼髒成這樣。

千尋:那我去洗個澡

千尋有沒有長大的樣子?XD

叫我嗎?

荒野女巫女士大駕光臨

這不是千尋還我可愛千尋QQ

你討厭我嗎?你討厭我嗎?你討厭我嗎?Q口Q

我覺得不行。嚶嚶嚶,嚇到我了QQ,上面明明那麼可愛。

你很挑耶

那我可以嗎?

這個嗯姆姆是我太挑嗎QQ,我相信肯定不是我的問題RRRRRR

看來該我上場了

你誰!?沒有 cue 你,下去(X

ㄍㄡˇ ?

爸爸,你怎麼帥成這樣!

各位明年見啦!!

掰噗~

tags: MOPCON 2020
Select a repo