> Some of the code has breif text of explaination which can be seen in the ipynb files which opens in google colab or jupyter notebook
>
> if we have to run these files one has to make changes accordingly to point it to the input files
# ML Prediction With Davis Weather station data (Sunrise,sunset)
Dates considered = 2022-10-01 to 2023-3-20
from november 2022 to march 2023
##### Folder for below project can be found here:
https://drive.google.com/drive/folders/1BSWzN0x9zX555RkfUqpbMf3nIv6x09AU?usp=drive_link
#### File for the prediction with code can be found here
https://colab.research.google.com/drive/1h0tOGuEQlgvAdkNHYtCOhtBpE81DFOVm#scrollTo=wqhHsUMsi4dQ
---
### Steps taken to Harmonize data
* Initial data cleanup
* Datetime fix
* Selecting columns required
* checking null values
* identifying days where solar radiation is below 500. where max_solar radiation is 1045 and min is 191
* Making notes on why certain days have low radiation
* min max scaler is applied to harmonize the units of data
# Adding new features
Below is the set of enriched data with sunrise and sunset
https://drive.google.com/file/d/1Ju7EPalNyaj9ugIunu8nU_DRvDFTPNTL/view?usp=drive_link
We have added sunrise sunset data to make sure the daylength is proper while predicting and also to make sure prediction wont start or extend beyond early morning and late evening.
* After adding sunrise sunset data we converted the day length to minutes
* creating new column called sun where it is set to 1 or 0 which indicates if the timestamp is between sunrise , sunset or if it is between sunset and sunrise where timestamp between sunrise and sunset is set to 1 and timestamp between sunset and sunrise which is evening to early morning is set to 0
#### Post Harmonized data
* Window length for 1 day is set
* target length is defined which is 96 data points which equals to 1 day prediction of 15minutes interval
* train test split is done where last 10 days data is used for testing
* Modeling using LSTM
* where we used sigmoid activation with 13 features from , epochs of 100
* early stopping technique is used for preventing overfitting
* predicting
output is Scaled back to its original units
(LSTM Model) Explanation
---
The following code is used to create an LSTM model using data from the Davis weather station dataset updated with sun rise-set details. It utilizes features such as Temp Out, Dew Pt, Wind Speed, Hi Speed, THSW Index,Wind Chill, Rain, Solar Rad, Solar Energy, Cool D/D, In Air Density and day_length to train the model. The values in the dataset are normalized using a min-max scaler. A sliding window of 1 days' worth of data points is applied
(more about sliding window can be seen here https://machinelearningmastery.com/time-series-forecasting-supervised-learning/).
Similarly target length is set to 96 ( Equal to total datapoints for a day). Out of the total 171 days in the dataset, 161 days are used for training, while the last 10 days are reserved for testing the model. A batch size of 32 is used for training. Additionally, early stopping and dropout techniques are applied to prevent overfitting.
`* While training we got RMSE value of **0.0324**`
Models output where comparison between actual and predicted value is
```
* Mean Squared Error (LSTM): **0.017188**
* Root Mean Squared Error (LSTM): **0.131102**
* R2 accuracy of LSTM model is: **0.802638**
```
# RNN Model with same dataset but without sun column
Code for this can be seen below:
https://drive.google.com/file/d/12q4uX2GYwiFU1bl-U7swWHId1rlqwyZ0/view?usp=sharing
Here we have considered window length of 3 days which is 288 datapoints
`While training we got RMSE value of **0.0694**`
Models output where comparison between actual and predicted value is
```
Root Mean Squared Error (RNN): **0.07966**
R2 accuracy of RNN model is: **0.9267**
```
---
# GRU Model with same data set but without sun column
code for this can be seen below:
https://drive.google.com/file/d/1O7o__HIRg-JwT3gOiRdW03LxtHnCs1Pu/view?usp=sharing
Here we have considered window length of 3 days which is 288 datapoints
`While training we got RMSE value of **0.0582**`
Models output where comparison between actual and predicted value is
```
Root Mean Squared Error (GRU): **0.060318**
R2 accuracy of RNN model is: **0.957977**
```
---
# ML Prediction With Davis Weather station data (without Sunrise,sunset)
Dates considered = 2022-10-01 to 2023-3-20
from november 2022 to march 2023
##### Folder for below project can be found here:
https://drive.google.com/drive/folders/1fVPK1bfiQS4bEGKgUtBhRch06DwX6z-A?usp=drive_link
File for the code can be found here
https://colab.research.google.com/drive/1EOJ1LLBkj7AcYDdCduXgYn6Hdu5RA0N6
### Steps taken to Harmonize data
* Initial data cleanup
* Datetime fix
* Selecting columns required
* checking null values
* identifying days where solar radiation is below 500. where on an average max solar is 1045 and min is 191
* Making notes on why certain days have low radiation
* min max scaler is applied to harmonize the units of data
Same steps which are taken above is also taken while predicting this model
except the fact that the testing days in this case was for 5 days and epoch size is 20 and window length of 288 data points which mounts to 3 days window length.
and early stopping is not applied here.
`* While training we got RMSE value of 0.0539`
Models output where comparison between actual and predicted value is
```
* Root Mean Squared Error (LSTM): **0.073379**
* R2 accuracy of LSTM model is : **0.941507**
```
# ML Prediction With Davis Weather station data (with Sunrise,sunset_Splitting_traintest_into_2files)
##### Folder for below project can be found here:
https://drive.google.com/drive/folders/17QUzG6aDT0Lx27-knJiJX5mmMoafM_oC?usp=drive_link
code for the model is below
https://colab.research.google.com/drive/1cGQ6SCCg5ob3otyiq6CJLXJIqm_UKC8j#scrollTo=7SNoie8K7gT2
Based on the suggestion we split the davis file into two different datasets
1) Training Data set
2) Testing Data set
this way there is no influence of train and test, which will reduce the noise in the output
to avoid overfitting we have employed early stopping and dropout layer
data has been harmonized for 15 minute interval
`While training we got RMSE value of **0.0264**`
Models output where comparison between actual and predicted value is
```
Mean Squared Error (LSTM): **0.020222**
Root Mean Squared Error (LSTM): **0.142203**
R2 accuracy of LSTM model is: **0.745141**
```
# ML Prediction With Southwest and Experimental Building station data(split train and test seperately)
from EB we have taken solarenergy output from struder(energy_solar(KWh))
##### Folder for below project can be found here:
https://drive.google.com/drive/folders/1FD0xldf5W-58E57dPUN1_JamIxum8E3t?usp=drive_link
code for the model is below:
https://colab.research.google.com/drive/1RiBYmP5v0vIGLpgkMyt0k1oWVy_uKcxs
This is the work on new set of data gathered from new set of weather stations which is (SW and EB)
Dataset has values from 2023-04-01 to 2024-01-08
So we have split the data into two different files to avoid any noise in prediction.
Steps taken to Harmonize data
* Initial data cleanup
* Datetime fix
* Selecting columns required
* checking null values
* Adressing missing values
* min max scaler is applied to harmonize the units of data
Harmonizing the datetime seperately and scaling down the date time values to 2 datapoints per hour
**Approach**: all the data points inbetween 0th minute to 29th minute is considered to take average of the datapoints to create a new time stamped data for 15th minute similarly
datapoints between 3oth minute to 59th minute is considered to create new datapoint of 45th minute in this way per hour we have 2 data points.
**** caveat ***** we dont know if this is a right approach
missing days report:
https://docs.google.com/document/d/1m6SWa7zl-B22HzJ9dPLpaC-6ZzCJOHOG/edit?usp=sharing&ouid=102088372888500442638&rtpof=true&sd=true
* in case of missing days we have considered previous days data as missing day data
* in case of hours we have considered same hour data from the previous day
window length of 1 day data is considered where each day has 48 data points.
so target lenght of 48 data points.
Epochs considered here is 100.
testing dates= 2023-12-16 to 2024-01-08
trainig dates = 2023-4-01 to 2023-12-15
`While training we got RMSE value of:**0.0502**`
Models output where comparison between actual and predicted value is
```
Mean Squared Error (LSTM): **0.012916**
Root Mean Squared Error (LSTM): **0.11365**
R2 accuracy of LSTM model is: **0.764591**
```
---
---
# Ml Prediction with new sensor data
Based on the learnings from above experiments we decided to run the models by adding sunset , sunrise data to existing sensor data. and run different models independently
with batch size of 32, and Epoch of 100 (increased / decreased while experimenting and window length of 1 day is considered here.)
* LSTM
* GRU
* GRU
* RNN
* ANN
* XGBoost
We had missing data on individual datasets obtained from sensors so we agregated (North West, North East, South West). sensor data to make one dataset .
### Code base for ML prediction can be seen here.
https://drive.google.com/drive/folders/1N3S31He-3kYPkGyUBdE_ZkIwxTBtDm0z?usp=sharing
This prediction folder has
1. **preprocessing folder**
in this folder we have preprocessing scripts for and each scripts have explaination
* Resampling data
* combining sunrise and sunset data
* missing datetime check
2. **Datasets folder**
* This folder contains data sets which are used by the programme
3. **Models without sun**
Dataset used in this doesnt have sunrise sunset data
This folder has
* LSTM
* GRU
* GRU
* RNN
* ANN
* XGBoost
folders each of will have
scaler file which is used for both train and test data
Model output file of respective models
script to run the model
and prediction output
4. **Models with sun**
Dataset used in this is enriched with sunset and sunrise data
This folder has
* LSTM
* GRU
* GRU
* RNN
* ANN
* XGBoost
folders each of will have
scaler file which is used for both train and test data
Model output file of respective models
script to run the model
and prediction output
# Output Report Matrix
https://docs.google.com/spreadsheets/d/151qDwYKnqN1mHDs7wQYAS-6SKNz2yYPw/edit?usp=sharing&ouid=102088372888500442638&rtpof=true&sd=true
