Try   HackMD

CodeZero-ML ReadMe

Notations:

  • Raw Data:
    • Orange
  • Internal Data:
    • Yellow
  • Result/Output:
    • Orange
  • Operation:
    • Red
  • Pseudo data:
    • white

Preprocessing:

Pipeline:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Input Data:

  • post text:
    ​​​​<post-id>\t<post-level>\t<text>
    
  • post key text:
    ​​​​<post-id>\t<text>
    
  • new words:
    ​​​​<word>\space<frequency>
    
  • dictionary:
    ​​​​<word>\space<frequency>
    
  • hand-crafted words:
    ​​​​<word>\space<defined-frequency>
    
  • blacklist words:
    ​​​​<word-to-remove>
    

Output data:

  • segmented post key text:
    ​​​​<post-id>\t<segmented-text>
    

Program:

  • Parsing:
    • Merge topic and subtopic into post-key-text
  • New Word detection:
    • input:
      • text file and separate characters with space.
    • usage:
      • modify "word_discovery.py" on the "corpus_file" argument.
  • Word Segmentation:
    • usage:
      • input:
          1. post key text
          1. dictionary
      • segmented in batch (to avoid OOM)

Keyword Extraction:

Pipeline:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Input Data:

  • segmented post key text:
    ​​​​<post-id>\t<text>
    
  • segmented post key text(jsonl):
    ​​​​<metadata to index>
    
  • segmented post key text index:
    ​​​​<binary index file>
    
  • candidate keywords:
    ​​​​<keyword>
    
  • hand-crafted keywords:
    ​​​​<keyword>
    

Output data:

  • keywords for each post:
    ​​​​{
    ​​​​    <docid>: 
    ​​​​        {<keywords>: <score>}
    ​​​​}
    

Program:

  • Convert to jsonl:
    • convert to jsonl for the indexing program to index
  • Indexing
    • index into binary file to calcualte the BM25 score
  • Candidate Word Retrieval:
    • Generate candidate keywords by POS-tagging and BM25 score
  • Keyword Extraction and score calculation:
    • Extract keywords from each post by candidate keywords and the BM25 score.

TODO: Word-to-word relation:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

TODO: Post-to-Post Recommendation (Inverted Index):

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Input data:

  • keywords for each post:
    ​​​​{
    ​​​​    <docid>: 
    ​​​​        [
    ​​​​            {<keyword_1>: <score>},
    ​​​​            {<keyword_2>: <score>}, ...
    ​​​​        ]
    ​​​​}
    
  • keyword-to-post:
    ​​​​{
    ​​​​    <keyword>: [docid_1, docid_2, ...]
    ​​​​}
    
  • keyword-to-keyword:
    ​​​​{
    ​​​​    <keyword>: 
    ​​​​    [
    ​​​​        {<keyword_1>: <score>}, 
    ​​​​        {<keyword_2>, <score>}, ...
    ​​​​    ]
    ​​​​}
    

TODO:

run-script sample:
Ex:

  • python segment.py arguments
  • sample file
    • 3 ~ 4 lines