# CodeZero-ML ReadMe ## Notations: - Raw Data: - Orange - Internal Data: - Yellow - Result/Output: - Orange - Operation: - Red - Pseudo data: - white ## Preprocessing: ### Pipeline: ![](https://i.imgur.com/zOimNma.png) ### Input Data: - post text: ``` <post-id>\t<post-level>\t<text> ``` - post key text: ``` <post-id>\t<text> ``` - new words: ``` <word>\space<frequency> ``` - dictionary: ``` <word>\space<frequency> ``` - hand-crafted words: ``` <word>\space<defined-frequency> ``` - blacklist words: ``` <word-to-remove> ``` ### Output data: - segmented post key text: ``` <post-id>\t<segmented-text> ``` ### Program: - Parsing: - Merge topic and subtopic into post-key-text - New Word detection: - input: - text file and separate **characters** with space. - usage: - modify "word_discovery.py" on the "corpus_file" argument. - Word Segmentation: - usage: - input: - 1. post key text - 2. dictionary - segmented in batch (to avoid OOM) ## Keyword Extraction: ### Pipeline: ![](https://i.imgur.com/bt7BpNv.png) ### Input Data: - segmented post key text: ``` <post-id>\t<text> ``` - segmented post key text(jsonl): ``` <metadata to index> ``` - segmented post key text index: ``` <binary index file> ``` - candidate keywords: ``` <keyword> ``` - hand-crafted keywords: ``` <keyword> ``` ### Output data: - keywords for each post: ``` { <docid>: {<keywords>: <score>} } ``` ### Program: - Convert to jsonl: - convert to jsonl for the indexing program to index - Indexing - index into binary file to calcualte the BM25 score - Candidate Word Retrieval: - Generate candidate keywords by POS-tagging and BM25 score - Keyword Extraction and score calculation: - Extract keywords from each post by candidate keywords and the BM25 score. ### TODO: Word-to-word relation: ![](https://i.imgur.com/P7JHrkU.png) ### TODO: Post-to-Post Recommendation (Inverted Index): ![](https://i.imgur.com/HqicrrM.png) ### Input data: - keywords for each post: ``` { <docid>: [ {<keyword_1>: <score>}, {<keyword_2>: <score>}, ... ] } ``` - keyword-to-post: ``` { <keyword>: [docid_1, docid_2, ...] } ``` - keyword-to-keyword: ``` { <keyword>: [ {<keyword_1>: <score>}, {<keyword_2>, <score>}, ... ] } ``` ## TODO: run-script sample: Ex: - python segment.py --arguments - sample file - 3 ~ 4 lines