Headline Generation

# Headline Generation * [Folder Structure](#folder) * [Usage](#usage) * [Preprocess Input Data](#preprocess) * [News Clustering](#cluster) * [Extract Important Keyword (Noun, Verb)](#extract) * [Generate headline candidate by Template](#phrase) * [Ranking by BERTScore](#ranking) ## <a name="folder"></a>Folder Structure ``` topic_generation │ README.md │ └───data │ │ c2c.csv │ │ ntk.csv │ └─── politics_news.csv │ └───result │ └───bert_score │ │ cluster.py │ ctfidf.py │ event_generation.py │ tokenizer.py └ stop.txt ``` ## Environment * OS: Ubuntu 18.04 * Python: v3.9.13 * GPU: 2080Ti * cuda version: 11.4 ## <a name="usage"></a>Usage * Package Installation: `pip install -r requirement.txt` * Generate Topic: `python event_generation.py` * Important Global Variable * MAX_CNT: 當data來源是c2c，每個Topic id要使用的新聞數量 * IS_C2C: 新聞來源是否為c2c * IS_NTK: 新聞來源是否為ntk * DIM: umap降維後的維度 ## <a name="preprocess"></a>Preprocess Input Data * Input Format: CSV file | title | content_summary | content_publish_time | uuid | body | url | | -------- | -------- | -------- | -------- | -------- | -------- | | String | String | Timestamp | String | String | String | * Remove Text about News Author By Regular Expression * File: `event_generation.py` * Function: `preprocess(data, add_label)` * data: dataframe read from csv * add_label: 輸入的是c2c data時，會根據topic_id來增加label欄位 * Note: `preprocess`時會使用pysdb進行斷句，並對dataframe新增first_sents欄位，使用第一句作為first_sents的值 * Remain `MAX_CNT` news in each topics (for ntk) * File: `event_generation.py` * Function: `remain_topk_in_day(data, max_cnt)` * data: dataframe after preprocess * max_cnt: max news count in each topics ## <a name="cluster"></a>News Clustering * Clustering By BERTopic * File: `event_generation.py` * Function: `clustering(data, feature_col)` * data: dataframe after preprocess * feature_col: 分群依據欄位，default='first_sents' * Output: dataframe原始欄位加上`label`欄位 * Note: * 輸入的是c2c data時，則不需要進行cluster * cluster細節請看`NewsCluster`介紹 ## <a name="extract"></a>Extract Important Keyword (Noun, Verb) * Tokenize First Sentence * File: `tokenizer.py` * Function: `ChineseTokenizer.full_tokenize(docs,titles)` * docs: 每篇文章的第一句話 * title: 每篇文章的標題 * Output: * ret_sent_tok: 第一句話經過斷詞的結果 * ret_noun: 標題經過斷詞且僅保留名詞的結果 * ret_verb: 第一句話經過斷詞且僅保留動詞的結果 * Note: * ret_verb中的動詞後會加上(`_VA`)`_VC`來表示(不)及物動詞 * ex. 抄襲_VC * Extract Keyword by cTF-IDF * File: `ctfidf.py` * Function 1: `c_tf_idf(documents, datasize, tokenize)` * documents: document group by label後需要計算TF-IDF分數的column * datasize: 原始資料筆數 * tokenize: 斷詞function * Output: * tf_idf: TF-IDF計算結果 * count: CountVectorizer * Note: * 取出重要名詞時，傳入的document為`full_tokenize`回傳的僅保留名詞之斷詞結果 * 取出重要動詞時，傳入的document為`full_tokenize`回傳的僅保留動詞之斷詞結果 * Function 2: `extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=4)` * tf_idf: `c_tf_idf`回傳的TF-IDF計算結果 * count: `c_tf_idf`回傳的CountVectorizer * docs_per_topic: document group by label * Output: * top_n_words: 每個label下的keyword以及對應的TF-IDF分數 ## <a name="phrase"></a>Generate headline candidate by Template * Generate headline candidate by Template * FILE: `event_generation.py` * Function: `generate_phrase_template(clustered_data, top_n_nouns, top_n_verbs)` * clustered_data: dataframe after clustering * top_n_nouns : key nouns from extract_top_n_words_per_topic * top_n_verbs : key verbs from extract_top_n_words_per_topic * Output: * final_table: headline candidate for each cluster * Note: * Template有四種: nn, nnv nv, nvn * Candidate需透過`check_gram_order`檢查詞是否依據Candidate中的順序出現在文本中 ## Ranking by BERTScore * Generate similarity score for (headline, first sentence) and (headline, title) * FILE: `event_generation.py` * Function: `ranking(clustered_data, candidate_table, top_n_nouns, top_n_verbs)` * clustered_data: dataframe after clustering * candidate_table: generated by generate_phrase_template * top_n_nouns : key nouns from extract_top_n_words_per_topic * top_n_verbs : key verbs from extract_top_n_words_per_topic * Note: * 每個headline candidate會有對應到first sentence的分數以及對應到title的分數 * headline candidate最終的分數由對first sentence的分數 * 0.3 + 對title的分數 * 0.7 * headline candidate中不及物動詞後面若有接名詞，分數會 * 0.98 * headline candidate中及物動詞後面若沒有接名詞，分數會 * 0.98 * headline candidate中包含動詞且沒有上述兩種情況，分數會 * 1.02 ## NewsCluster A class for news clustering * Parameters initialization * model: <a href="https://huggingface.co/ckiplab/albert-base-chinese">sbert sbert pretrained model</a> * tokenizer: <a href="https://github.com/ckiplab/ckip-transformers">chinese tokenizer based on ckip</a> * umap_model: UMAP * cluster_mode: HDBScan * top_n_words: extract top `n` keyword for each topics * title_black_list: 需要過濾掉的報導 * embeded_topic_modeling(data, feature='first_sents', filter_rec=True, return_outlier=False) * Usage: Topic modeling for the input data * Parameters * data(DataFrame): news data after preprocess * feature(String): the columns which clusters based on * filter_rec(Bool): remove the record which title contains the keywords in `title_black_list` * return_outlier(Bool): Treat each outlier as one topic (for ntk) * Output: * DataFrame with label and keyword columns * _record_filter(data) * Usage: remove the record which title contains the keywords in `title_black_list`. * Parameters: * data(DataFrame): news data after preprocess * Output: dataframe after drop the records. * _clean(docs, rm_noisy) * Usage: remove some useless text in sentences, such as the brackets and the text in bracket * Parameters * docs(list of String): list of sentence need to be cleaned * rm_noisy(Bool): remove stop word or not * Output: list of sentences after remove some useless text * _check_keyword(text, keywords, consider_rank) * Usage: check how many keywords appear in the text * Parameters: * text(String): target sentence * keywords(list of String): list of keywords * consider_rank(Bool): if `consider_rank == True`, the `keywords` are sorted by importance with descending order. Match more important keywords, get more score. * Output: matching score for `text` * _generate_feature(data, feature_col) * Usage: generate the feature which cluster based on * Parameters * data(DataFrame): news data after preprocess * feature_col: the column which content need to merge with `title` column * Output: list of sentence after merge the content of title and `feature_col` * _cluster(docs, rm_noisy) * Usage: first stage clustering by BERTopic * Parameters: * docs(list of String): sentences need to be clustered * rm_noisy(Bool): remove the stop words in sentences or not * Output: Dataframe with label and keyword column * _generate_keyword_with_pos(data, cluster_col) * Usage: generate keywords with specific name entity or POS tagging by ctf-idf * Parameters * data(DataFrame): data after `_cluster` * cluster_col: column generated by `_generate_feature` * Output: Dictionary which key is label, value is keywords * _same_topic_detection(topic_keywords): * Usage: find the high keywords overlapping clusters * Parameters: * topic_keywords: keywords generated by `_generate_keyword_with_pos` for each topic * Output: * Dictionary which key is overlapping keywords, value is the cluster label which contains those keywords.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.