# ASHRAE - Great Energy Predictor III
## Data preprocessing
* [Timestamp alignment](https://www.kaggle.com/nz0722/aligned-timestamp-lgbm-by-meter-type)
Thanks for **NZ** greate kernel. Its explanation is very clear.
* **Interpolation**
First, I filled NA in weather data. However, some timestamps in certain site_id are not in weather data with same site_id. It will some NA after left join train.csv and weather data. Therefore, I used interpolation to fill NA again after train.csv left join the weather data.
* **find_bad_rows()**
Very great function from **Robert Stockton** [public kernel](https://www.kaggle.com/purist1024/ashrae-simple-data-cleanup-lb-1-08-no-leaks) to remove bad rows, and thanks his sharing. It make my model amazing improvement from 1.07x to 1.06x.
* Correct the unit for electric meter of site 0
## Feature engineering
In the section I will present different features I try.
* **Humidity**
by air temperature and dew temperature, we can get the humidity.
* **[meteocalc](https://github.com/malexer/meteocalc)**(not used)
by using meteocalc, we can get different meteorological variables such as **Dew Point**, **Heat Index**, **Wind Chill**, **Feels Like temperature**. However, they didn't improve the score of my models.
* **Lag feature**(not used)
I tried lag feature with window = 3 or 7 in air_temperature, cloud_coverage, dew_temperature, precip_depth_1_hr to get mix, max, std, mean like as public notebook, but it didn't improve my cv score and LB.
* **Grouping feature**
group the weather data(air_temperature, cloud_coverage, dew_temperature, precip_depth_1_hr) to get the average values of day in each site.
by using group features, I got improvement 0.01 in LB(1.08 to 1.07)
* **Combination of air temperature and dew temperature**
Due to the two features have hight importance values in lgbm, I combine the two feature together by calculating the difference between the two features.
* **Time feature**(not used)
Sin and Cos transformation from time data, but It's not improve my model.
## CV
1. 利用時間去切分的cv很準ex 前半年後半年
2. 用building_id來做stratifiedkfold的效果也不錯
用不同切fold的方式可以訓練出不同的model(model diversity比較高)
## ensemble
1. 最後利用leakage data 來當作validation set 找最好的ensemble weight,得到非常好的結果。