MOD Log Survey
===
keypoints:
1. Log parameter classification.
2. Study the difference between specific words in the log and general articles.
3. Influence of word abbreviations
4. log level classification and instance mapping (Ex : 140.114.213.70 <-> elk.es.ntu.edu.tw)
Feature of log files:
1. short sentences holding dense information
2. small word pool
3. sentences with timestamp
4. scheduled message
Potential issues:
1. Abbreviation
Testing:
1. Extracting from logfile, get the inportant words
2. Extract values
3. overlaping / sentence structure
Input files:
Processing flow:
1. Log Parameter Classification:
(140.114.175.38 -> IP, /var/data/ -> DIR)
2. Word abbreviations:
|Origin |Abbreviation|
|-------|------------|
|Central Process Unit|CPU |
|Memory |Mem |
3. Use level-info to remove the useless log
4. Label:
Probe-based label : Use probes with specific information for classification (cpu-load,network bandwidth...)
Type-based label : Use general system information for classification(Network,File system...)
5. Classification:
based-line: bag of word with one-hot encoding
Embedding: Use embedding based method
6. Combine classification result and extract value
___
## Process Survey
1. web log, preprocessing : </br>https://airccj.org/CSCP/vol1/cscp0101.pdf
2. event log, batch, production stage:</br>https://www.researchgate.net/profile/Niels_Martin/publication/287198292_Batch_processing_definition_and_event_log_identification/links/567291bd08aeb8b21c70c44f/Batch-processing-definition-and-event-log-identification.pdf
3. Logcluster
4. Drain