CodeZero-ML ReadMe

Notations:

Raw Data:
- Orange
Internal Data:
- Yellow
Result/Output:
- Orange
Operation:
- Red
Pseudo data:
- white

Preprocessing:

Pipeline:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Input Data:

post text:

<post-id>\t<post-level>\t<text>

post key text:
```
<post-id>\t<text>
```
new words:
```
<word>\space<frequency>
```
dictionary:
```
<word>\space<frequency>
```

hand-crafted words:

<word>\space<defined-frequency>

blacklist words:
```
<word-to-remove>
```

Output data:

segmented post key text:

<post-id>\t<segmented-text>

Program:

Parsing:
- Merge topic and subtopic into post-key-text
New Word detection:
- input:
  - text file and separate characters with space.
- usage:
  - modify "word_discovery.py" on the "corpus_file" argument.
Word Segmentation:
- usage:
  - input:
    - 1. post key text
    - 1. dictionary
  - segmented in batch (to avoid OOM)

Keyword Extraction:

Pipeline:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Input Data:

segmented post key text:
```
<post-id>\t<text>
```
segmented post key text(jsonl):
```
<metadata to index>
```
segmented post key text index:
```
<binary index file>
```
candidate keywords:
```
<keyword>
```
hand-crafted keywords:
```
<keyword>
```

Output data:

keywords for each post:

{
    <docid>: 
        {<keywords>: <score>}
}

Program:

Convert to jsonl:
- convert to jsonl for the indexing program to index
Indexing
- index into binary file to calcualte the BM25 score
Candidate Word Retrieval:
- Generate candidate keywords by POS-tagging and BM25 score
Keyword Extraction and score calculation:
- Extract keywords from each post by candidate keywords and the BM25 score.

TODO: Word-to-word relation:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

TODO: Post-to-Post Recommendation (Inverted Index):

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Input data:

keywords for each post:

{
    <docid>: 
        [
            {<keyword_1>: <score>},
            {<keyword_2>: <score>}, ...
        ]
}

keyword-to-post:

{
    <keyword>: [docid_1, docid_2, ...]
}

keyword-to-keyword:

{
    <keyword>: 
    [
        {<keyword_1>: <score>}, 
        {<keyword_2>, <score>}, ...
    ]
}

TODO:

run-script sample:
Ex:

python segment.py –arguments
sample file
- 3 ~ 4 lines

CodeZero-ML ReadMe

Notations:

Preprocessing:

Pipeline:

Input Data:

Output data:

Program:

Keyword Extraction:

Pipeline:

Input Data:

Output data:

Program:

TODO: Word-to-word relation:

TODO: Post-to-Post Recommendation (Inverted Index):

Input data:

TODO:

Read more

Money With Pamela

去香港

Pam 的香港購買清單：

TREC CAsT 2021