# CodeZero-ML ReadMe ## Notations: - Raw Data: - Orange - Internal Data: - Yellow - Result/Output: - Orange - Operation: - Red - Pseudo data: - white ## Preprocessing: ### Pipeline:  ### Input Data: - post text: ``` <post-id>\t<post-level>\t<text> ``` - post key text: ``` <post-id>\t<text> ``` - new words: ``` <word>\space<frequency> ``` - dictionary: ``` <word>\space<frequency> ``` - hand-crafted words: ``` <word>\space<defined-frequency> ``` - blacklist words: ``` <word-to-remove> ``` ### Output data: - segmented post key text: ``` <post-id>\t<segmented-text> ``` ### Program: - Parsing: - Merge topic and subtopic into post-key-text - New Word detection: - input: - text file and separate **characters** with space. - usage: - modify "word_discovery.py" on the "corpus_file" argument. - Word Segmentation: - usage: - input: - 1. post key text - 2. dictionary - segmented in batch (to avoid OOM) ## Keyword Extraction: ### Pipeline:  ### Input Data: - segmented post key text: ``` <post-id>\t<text> ``` - segmented post key text(jsonl): ``` <metadata to index> ``` - segmented post key text index: ``` <binary index file> ``` - candidate keywords: ``` <keyword> ``` - hand-crafted keywords: ``` <keyword> ``` ### Output data: - keywords for each post: ``` { <docid>: {<keywords>: <score>} } ``` ### Program: - Convert to jsonl: - convert to jsonl for the indexing program to index - Indexing - index into binary file to calcualte the BM25 score - Candidate Word Retrieval: - Generate candidate keywords by POS-tagging and BM25 score - Keyword Extraction and score calculation: - Extract keywords from each post by candidate keywords and the BM25 score. ### TODO: Word-to-word relation:  ### TODO: Post-to-Post Recommendation (Inverted Index):  ### Input data: - keywords for each post: ``` { <docid>: [ {<keyword_1>: <score>}, {<keyword_2>: <score>}, ... ] } ``` - keyword-to-post: ``` { <keyword>: [docid_1, docid_2, ...] } ``` - keyword-to-keyword: ``` { <keyword>: [ {<keyword_1>: <score>}, {<keyword_2>, <score>}, ... ] } ``` ## TODO: run-script sample: Ex: - python segment.py --arguments - sample file - 3 ~ 4 lines
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up