CodeZero-ML ReadMe
Notations:
- Raw Data:
- Internal Data:
- Result/Output:
- Operation:
- Pseudo data:
Preprocessing:
Pipeline:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- post text:
- post key text:
- new words:
- dictionary:
- hand-crafted words:
- blacklist words:
Output data:
Program:
- Parsing:
- Merge topic and subtopic into post-key-text
- New Word detection:
- input:
- text file and separate characters with space.
- usage:
- modify "word_discovery.py" on the "corpus_file" argument.
- Word Segmentation:
- usage:
- input:
- segmented in batch (to avoid OOM)
Pipeline:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- segmented post key text:
- segmented post key text(jsonl):
- segmented post key text index:
- candidate keywords:
- hand-crafted keywords:
Output data:
Program:
- Convert to jsonl:
- convert to jsonl for the indexing program to index
- Indexing
- index into binary file to calcualte the BM25 score
- Candidate Word Retrieval:
- Generate candidate keywords by POS-tagging and BM25 score
- Keyword Extraction and score calculation:
- Extract keywords from each post by candidate keywords and the BM25 score.
TODO: Word-to-word relation:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
TODO: Post-to-Post Recommendation (Inverted Index):
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- keywords for each post:
- keyword-to-post:
- keyword-to-keyword:
TODO:
run-script sample:
Ex: