# Walmart Sales Forecasting Project:<br />Baseline Model Guideline
This is a simple guideline to help you build your machine learning project on Azure ML studio. Through the practice, you will:
1. understand how to build a machine learning pipeline
2. understand how to preprocess your data
3. understand how to evaluate the performance
4. be proud of yourself :smile:
We also prepare some reference links for you to learn more ML concepts :notebook:
## Step1. Data Preparation
- [ ] Download the data from Kaggle Walmart competition:
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data
:::info
You'll need to register a Kaggle account.
Welcome to join the biggest data scientist community!!:heart:
:::
- [ ] Upload **features.csv**, **stores.csv** and **train.csv** from local to Azure ML Studio
## Step2. Data Preprocessing
- [ ] Click the menu item [Experiments] and new a blank experiment.
- [ ] Import and join the tables we just uploaded, **features.csv*, train.csv* and *stores.csv*, into a wide table with **sample size = 421,570**.
:::info
Try to use [store] and/or [date] columns as keys to join the tables :mega:
:::
- [ ] In addition to Weekly_Sales(target variable), select useful features to predict the weekly sales.
:::info
Before modeling, your table is necessary to include features and target variable. :mega:
:::
- [ ] If your features are categories. Remember to use [Edit Metadata] to make them categorical.
## Step3. Model Building
The following image is the model pipeline you'll build. You'll learnin how to:
1. split the data into train, validation and test data[1].
* Train data: the model learn from train data.
* Validation data: use evaluation result on validation data to fine-tune your model.
* Test data: we treat test data as previously unseen data to test your model performance.
2. choose your regression model
:::warning
Regression mode is used to predict numeric target value e.g. age, sales amount, salary, etc. :bulb:
:::
3. diagnose your model performance
* understand evaluation metric: Mean Square Error
* recognize overfitting and underfitting problem
:::warning
Please see the reference links and you'll learn it deeply in the Udemy course:bulb:
:::

- [ ] Split the table into three parts, train, validation and test data.<br />The details are as belows:
1. Divide the table into train and test datsets.
Please set [fraction of rows in the first ouptut dataset] to 0.8 and [random seed] to 5566

:::danger
the fraction ratio and random seed here can't be changed:mega:
:::
2. Further, split the train data into train and validation data.
Please set [fraction of rows in the first ouptut dataset] to 0.8 at first and feel free to adjust it later.
- [ ] Use train data to build the regression model. You can use any regression model you prefer.
- [ ] Use the model which has already learned from train data to make prediction on validation data, then evaluate and find tune your model based on the performance on validation data.
:::info
We treat Mean Absolute Error(MAE) metric as our indicator[2], which is the smaller the better. :mega:
:::
- [ ] Use the trained model to make prediction on test data and see the final evaluation result.
:::info
We care more about MAE score on test data because we treat test data as previously unseen data to simulate the performance on production.:mega:
- [ ] Compare the **MAE** results between validation and test data.
If there is big gap between MAE values on validation and test data, you get **overfitting[3]** problem.
If both MAE values are quite large, you get **underfitting[4]** problem.
You could do the following steps to improve your results:
1. Create or add more features to make model learn well.
2. Set different fraction ratio in train / valid split node.
3. Try different regression models
4. Select different set of model hyper-parameters
:::warning
Try above to get the best performance on validation data. Then apply the new model to predict test data and see if it behave well too:bulb:
:::
## Step4. Submission
- [ ] Screenshot your **MAE results on test data with your name**. And upload the image to AI Learning Program Wiki page.

## Reference
1. Train, validation and test split:
https://developers.google.com/machine-learning/crash-course/validation/another-partition
2. Mean Absolute Error(MAE):
https://en.wikipedia.org/wiki/Mean_absolute_error
3. Overfitting:
https://developers.google.com/machine-learning/glossary/#overfitting
4. Underfitting:
https://developers.google.com/machine-learning/glossary/#underfitting