# NTU Machine Learning Homework 1
###### tags: `NTU_ML` `Machine Learning`
## How to choose features of data
* After observing the training data visualized image, you can be aware of the relationship between the PM2.5 feature and the others.
* For instance, the CO image, NO image, NO2 image, and NOx image are much more correlated with PM2.5.

* I also choose PM10, WS_HR, RAINFALL, RH, WIND_SPEED, and PM2.5 which you can see [here](/HW1/Programming/train_data_img/)
* I used Zscore normalization to implement in my project and can see as below
* You can see the different result of using or unusing normalization with the same config.
| Epoch | Regression | LR | Feats | Batch Size | Loss Fn. | Opti. | RMSE | Data Filter | Norm. Data |
| :------: | :------: | -------- | :------: | :------: | :------: | :------: | :------: | :------: | :------: |
| 200 | 1st-order| 0.015 | [1-4, 6-9, 13, 14] | 1024 | MSE | Adam | 2.44623 | Yes|Yes|
| 200 | 1st-order| 0.015 | [1-4, 6-9, 13, 14] | 1024 | MSE | Adam | 2.44623 | Yes|No|
## Hyperparameter and Preprocessing
* All my testing config can be found in Training Result.xlsx
* I used a filter to choose valid data and set a threshold by observing the visualized figure of all features.
## My takeaway
* **(Solved->See the last paragraph)**Using normalization is not like what I thought. Practically speaking, using normalization can gather all data to a specific area that the model can converge much more rapidly. But, in this case, the result is worse and also appear negative value of the PM2.5 result. According to [this page](https://blog.csdn.net/u010947534/article/details/86632819?spm=1001.2014.3001.5506), maybe the normalization method is not suitable in my case.
* **(Solved->See the last paragraph)**I also figured that using the stored weight and bias by my pretrained model is not the right way. I used pickle to store the dump parameters during the training and used the best one as my pretrained parameter. But it's still not that good enough.
* The better way in this project to enhance your accuracy is tuning your training config and select good features.
* After discussing with my friend, I figured out the problem and tried to solve it successfully by fitting numpy random seed. Then, the parameter will truly fix but normalization **is still not working** to help model converging.
## Update
* 2022/12/06 update - Refer to [相關](https://www.youtube.com/watch?v=z-21v0EoFh4&ab_channel=CUSTCourses) taught by Dr.李柏堅, I use `Pearson Correlation` to compute the correlation of each factor and PM2.5 and the result is shown as below. According to the [video](https://www.youtube.com/watch?v=z-21v0EoFh4&ab_channel=CUSTCourses), `|r| < 0.4` is low correlation, `0.4 ≦ |r| < 0.7`is medium correlation, and `0.7 ≦ |r| < 1` is high correlation. So, the factor **<font color=#FF0000>**`CO`, `NO`, `NO2`, `NOx`, `PM10`, and `SO2`**</font>** are quite suitable as our input data to address this regression problem.
|Factor|AMB_TEMP|CO|NO|NO2| NOx|O3| PM10| WS_HR| RAINFALL| RH| SO2| WD_HR| WIND_DIREC|WIND_SPEED|
| :----: | :------------: | :-----------: | :-----------: | :-----------: | :----------: | :----------- | :----------- | :------------: | :------------: | :------------: | :-----------: | :-----------: | :-----------: | :-----------: |
| r | -0.176147465 | 0.659147668 | 0.227219147 | 0.554273687 | 0.51365014 | 0.233923944 | 0.818868214 | -0.102047405 | -0.060801221 | -0.081576429 | 0.361333416 | 0.171932397 | 0.137658351 | -0.10119696 |