# Headline Generation
* [Folder Structure](#folder)
* [Usage](#usage)
* [Preprocess Input Data](#preprocess)
* [News Clustering](#cluster)
* [Extract Important Keyword (Noun, Verb)](#extract)
* [Generate headline candidate by Template](#phrase)
* [Ranking by BERTScore](#ranking)
## <a name="folder"></a>Folder Structure
```
topic_generation
│ README.md
│
└───data
│ │ c2c.csv
│ │ ntk.csv
│ └─── politics_news.csv
│
└───result
│
└───bert_score
│
│ cluster.py
│ ctfidf.py
│ event_generation.py
│ tokenizer.py
└ stop.txt
```
## Environment
* OS: Ubuntu 18.04
* Python: v3.9.13
* GPU: 2080Ti
* cuda version: 11.4
## <a name="usage"></a>Usage
* Package Installation: `pip install -r requirement.txt`
* Generate Topic: `python event_generation.py`
* Important Global Variable
* MAX_CNT: 當data來源是c2c,每個Topic id要使用的新聞數量
* IS_C2C: 新聞來源是否為c2c
* IS_NTK: 新聞來源是否為ntk
* DIM: umap降維後的維度
## <a name="preprocess"></a>Preprocess Input Data
* Input Format: CSV file
| title | content_summary | content_publish_time | uuid | body | url |
| -------- | -------- | -------- | -------- | -------- | -------- |
| String | String | Timestamp | String | String | String |
* Remove Text about News Author By Regular Expression
* File: `event_generation.py`
* Function: `preprocess(data, add_label)`
* data: dataframe read from csv
* add_label: 輸入的是c2c data時,會根據topic_id來增加label欄位
* Note: `preprocess`時會使用pysdb進行斷句,並對dataframe新增first_sents欄位,使用第一句作為first_sents的值
* Remain `MAX_CNT` news in each topics (for ntk)
* File: `event_generation.py`
* Function: `remain_topk_in_day(data, max_cnt)`
* data: dataframe after preprocess
* max_cnt: max news count in each topics
## <a name="cluster"></a>News Clustering
* Clustering By BERTopic
* File: `event_generation.py`
* Function: `clustering(data, feature_col)`
* data: dataframe after preprocess
* feature_col: 分群依據欄位,default='first_sents'
* Output: dataframe原始欄位加上`label`欄位
* Note:
* 輸入的是c2c data時,則不需要進行cluster
* cluster細節請看`NewsCluster`介紹
## <a name="extract"></a>Extract Important Keyword (Noun, Verb)
* Tokenize First Sentence
* File: `tokenizer.py`
* Function: `ChineseTokenizer.full_tokenize(docs,titles)`
* docs: 每篇文章的第一句話
* title: 每篇文章的標題
* Output:
* ret_sent_tok: 第一句話經過斷詞的結果
* ret_noun: <b>標題</b>經過斷詞且僅保留<b>名詞</b>的結果
* ret_verb: <b>第一句話</b>經過斷詞且僅保留<b>動詞</b>的結果
* Note:
* ret_verb中的動詞後會加上(`_VA`)`_VC`來表示(不)及物動詞
* ex. 抄襲_VC
* Extract Keyword by cTF-IDF
* File: `ctfidf.py`
* Function 1: `c_tf_idf(documents, datasize, tokenize)`
* documents: document group by label後需要計算TF-IDF分數的column
* datasize: 原始資料筆數
* tokenize: 斷詞function
* Output:
* tf_idf: TF-IDF計算結果
* count: CountVectorizer
* Note:
* 取出重要名詞時,傳入的document為`full_tokenize`回傳的僅保留<b>名詞</b>之斷詞結果
* 取出重要動詞時,傳入的document為`full_tokenize`回傳的僅保留<b>動詞</b>之斷詞結果
* Function 2: `extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=4)`
* tf_idf: `c_tf_idf`回傳的TF-IDF計算結果
* count: `c_tf_idf`回傳的CountVectorizer
* docs_per_topic: document group by label
* Output:
* top_n_words: 每個label下的keyword以及對應的TF-IDF分數
## <a name="phrase"></a>Generate headline candidate by Template
* Generate headline candidate by Template
* FILE: `event_generation.py`
* Function: `generate_phrase_template(clustered_data, top_n_nouns, top_n_verbs)`
* clustered_data: dataframe after clustering
* top_n_nouns : key nouns from extract_top_n_words_per_topic
* top_n_verbs : key verbs from extract_top_n_words_per_topic
* Output:
* final_table: headline candidate for each cluster
* Note:
* Template有四種: nn, nnv nv, nvn
* Candidate需透過`check_gram_order`檢查詞是否依據Candidate中的順序出現在文本中
## Ranking by BERTScore
* Generate similarity score for (headline, first sentence) and (headline, title)
* FILE: `event_generation.py`
* Function: `ranking(clustered_data, candidate_table, top_n_nouns, top_n_verbs)`
* clustered_data: dataframe after clustering
* candidate_table: generated by generate_phrase_template
* top_n_nouns : key nouns from extract_top_n_words_per_topic
* top_n_verbs : key verbs from extract_top_n_words_per_topic
* Note:
* 每個headline candidate會有對應到first sentence的分數以及對應到title的分數
* headline candidate最終的分數由對first sentence的分數 * 0.3 + 對title的分數 * 0.7
* headline candidate中不及物動詞後面若有接名詞,分數會 * 0.98
* headline candidate中及物動詞後面若沒有接名詞,分數會 * 0.98
* headline candidate中包含動詞且沒有上述兩種情況,分數會 * 1.02
## NewsCluster
A class for news clustering
* Parameters initialization
* model: <a href="https://huggingface.co/ckiplab/albert-base-chinese">sbert sbert pretrained model</a>
* tokenizer: <a href="https://github.com/ckiplab/ckip-transformers">chinese tokenizer based on ckip</a>
* umap_model: UMAP
* cluster_mode: HDBScan
* top_n_words: extract top `n` keyword for each topics
* title_black_list: 需要過濾掉的報導
* embeded_topic_modeling(data, feature='first_sents', filter_rec=True, return_outlier=False)
* Usage: Topic modeling for the input data
* Parameters
* data(DataFrame): news data after preprocess
* feature(String): the columns which clusters based on
* filter_rec(Bool): remove the record which title contains the keywords in `title_black_list`
* return_outlier(Bool): Treat each outlier as one topic (for ntk)
* Output:
* DataFrame with label and keyword columns
* _record_filter(data)
* Usage: remove the record which title contains the keywords in `title_black_list`.
* Parameters:
* data(DataFrame): news data after preprocess
* Output: dataframe after drop the records.
* _clean(docs, rm_noisy)
* Usage: remove some useless text in sentences, such as the brackets and the text in bracket
* Parameters
* docs(list of String): list of sentence need to be cleaned
* rm_noisy(Bool): remove stop word or not
* Output: list of sentences after remove some useless text
* _check_keyword(text, keywords, consider_rank)
* Usage: check how many keywords appear in the text
* Parameters:
* text(String): target sentence
* keywords(list of String): list of keywords
* consider_rank(Bool): if `consider_rank == True`, the `keywords` are sorted by importance with descending order. Match more important keywords, get more score.
* Output: matching score for `text`
* _generate_feature(data, feature_col)
* Usage: generate the feature which cluster based on
* Parameters
* data(DataFrame): news data after preprocess
* feature_col: the column which content need to merge with `title` column
* Output: list of sentence after merge the content of title and `feature_col`
* _cluster(docs, rm_noisy)
* Usage: first stage clustering by BERTopic
* Parameters:
* docs(list of String): sentences need to be clustered
* rm_noisy(Bool): remove the stop words in sentences or not
* Output: Dataframe with label and keyword column
* _generate_keyword_with_pos(data, cluster_col)
* Usage: generate keywords with specific name entity or POS tagging by ctf-idf
* Parameters
* data(DataFrame): data after `_cluster`
* cluster_col: column generated by `_generate_feature`
* Output: Dictionary which key is label, value is keywords
* _same_topic_detection(topic_keywords):
* Usage: find the high keywords overlapping clusters
* Parameters:
* topic_keywords: keywords generated by `_generate_keyword_with_pos` for each topic
* Output:
* Dictionary which key is overlapping keywords, value is the cluster label which contains those keywords.