# 1. Weekly Report(2020/12/21 - 2020/12/25)
### Main Target
Use *Openstack* dataset and use the unsupervise method for clustering.
1. Analyze key keywords in log sentences and use them as the feature.
2. Determine whether to use ML to enhance the results according to the clustering performance.
3. Compare different clustering and feature selection methods.
### Dataset analysis
Openstack:
A total of 2000 log sentences, the data composition is as follows:
- **nova-compute.log:**
Startup and execution of virtual machine instances.
- **nova-api.log:**
User interaction with OpenStack and interaction between OpenStack components.
- **nova-scheduler.log:**
Scheduling, assigning tasks to subsystems and message queues.
### Data Preprocess
- **Remove punctuation and stop word**
- **Word stemming:**
> Ex: removing and removal => remov
- **Replace specific terms:**
> 0673dd71-34c5-4fbb-86c4-40623fbe45b4 [InstanceID]
> 10.11.1.1 [IP]
> /var/lib/nova/instances/base/ [DIR]
### K-Means
#### How to choose K (Remove duplicate log sentences)
- **SSE(Sum of the Squared Errors)**
Calculate the squared error between each data point and the cluster center it belongs to, then take the sum. The lower the value, the better the clustering result of each group.

- **Silhouette Coefficient**
It not only compares the similarity of intra-cluster but also compares the similarity of inter-cluster. The higher the value, the lower the similarity of inter-cluster and the higher the similarity of intra-cluster. Usually compare it with SSE to select the best K value.

- **Calinski-Harabasz Coefficient**
Using the covariance matrix to calculate the ratio between intra-cluster and inter-cluster which is much faster than Silhouette Coefficient calculation.

### Experiment setting and result
1. Comparison of the group numbers
**k = 3**

**k = 5**

2. Comparison of different Vectorizers (Total number of words : 101)
Count vectorizers : Word count for words
**[NUM]** --- 3344
**[DIR]** --- 1338
**[IP]** --- 1225
**time** --- 1017
**statu** --- 1017
**len** --- 1017
**http/1.1** --- 1017
**get** --- 931
**instanc** --- 888
**[Instance ID]** --- 743
**file** --- 193
**imag** --- 185
**node** --- 179
**base** --- 172
**thi** --- 164
**event** --- 152
**use** --- 145
**the** --- 136
**vcpu** --- 135
**lifecycl** --- 109

TF-IDF vectorizers : TF-IDF index for words
**re-creat** --- 6.809642865355678
**match** --- 6.809642865355678
**it** --- 6.809642865355678
**instancelist** --- 6.809642865355678
**did** --- 6.809642865355678
**sync** --- 6.521960792903897
**host** --- 6.404177757247514
**young** --- 6.2988172415896875
**too** --- 6.2988172415896875
**view** --- 5.828813612343952
**used_vcpu** --- 5.828813612343952
**used_ram** --- 5.828813612343952
**used_disk** --- 5.828813612343952
**usabl** --- 5.828813612343952
**updat** --- 5.828813612343952
**total_vcpu** --- 5.828813612343952
**record** --- 5.828813612343952
**phys_ram** --- 5.828813612343952
**phys_disk** --- 5.828813612343952
**pci_stat** --- 5.828813612343952
