ASHRAE - Great Energy Predictor III

# ASHRAE - Great Energy Predictor III ## Data preprocessing * [Timestamp alignment](https://www.kaggle.com/nz0722/aligned-timestamp-lgbm-by-meter-type) Thanks for **NZ** greate kernel. Its explanation is very clear. * **Interpolation** First, I filled NA in weather data. However, some timestamps in certain site_id are not in weather data with same site_id. It will some NA after left join train.csv and weather data. Therefore, I used interpolation to fill NA again after train.csv left join the weather data. * **find_bad_rows()** Very great function from **Robert Stockton** [public kernel](https://www.kaggle.com/purist1024/ashrae-simple-data-cleanup-lb-1-08-no-leaks) to remove bad rows, and thanks his sharing. It make my model amazing improvement from 1.07x to 1.06x. * Correct the unit for electric meter of site 0 ## Feature engineering In the section I will present different features I try. * **Humidity** by air temperature and dew temperature, we can get the humidity. * **[meteocalc](https://github.com/malexer/meteocalc)**(not used) by using meteocalc, we can get different meteorological variables such as **Dew Point**, **Heat Index**, **Wind Chill**, **Feels Like temperature**. However, they didn't improve the score of my models. * **Lag feature**(not used) I tried lag feature with window = 3 or 7 in air_temperature, cloud_coverage, dew_temperature, precip_depth_1_hr to get mix, max, std, mean like as public notebook, but it didn't improve my cv score and LB. * **Grouping feature** group the weather data(air_temperature, cloud_coverage, dew_temperature, precip_depth_1_hr) to get the average values of day in each site. by using group features, I got improvement 0.01 in LB(1.08 to 1.07) * **Combination of air temperature and dew temperature** Due to the two features have hight importance values in lgbm, I combine the two feature together by calculating the difference between the two features. * **Time feature**(not used) Sin and Cos transformation from time data, but It's not improve my model. ## CV 1. 利用時間去切分的cv很準ex 前半年後半年 2. 用building_id來做stratifiedkfold的效果也不錯用不同切fold的方式可以訓練出不同的model(model diversity比較高) ## ensemble 1. 最後利用leakage data 來當作validation set 找最好的ensemble weight，得到非常好的結果。