Walmart Sales Forecasting Project:<br />Baseline Model Guideline

# Walmart Sales Forecasting Project:<br />Baseline Model Guideline This is a simple guideline to help you build your machine learning project on Azure ML studio. Through the practice, you will: 1. understand how to build a machine learning pipeline 2. understand how to preprocess your data 3. understand how to evaluate the performance 4. be proud of yourself :smile: We also prepare some reference links for you to learn more ML concepts :notebook: ## Step1. Data Preparation - [ ] Download the data from Kaggle Walmart competition: https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data :::info You'll need to register a Kaggle account. Welcome to join the biggest data scientist community!!:heart: ::: - [ ] Upload **features.csv**, **stores.csv** and **train.csv** from local to Azure ML Studio ## Step2. Data Preprocessing - [ ] Click the menu item [Experiments] and new a blank experiment. - [ ] Import and join the tables we just uploaded, **features.csv*, train.csv* and *stores.csv*, into a wide table with **sample size = 421,570**. :::info Try to use [store] and/or [date] columns as keys to join the tables :mega: ::: - [ ] In addition to Weekly_Sales(target variable), select useful features to predict the weekly sales. :::info Before modeling, your table is necessary to include features and target variable. :mega: ::: - [ ] If your features are categories. Remember to use [Edit Metadata] to make them categorical. ## Step3. Model Building The following image is the model pipeline you'll build. You'll learnin how to: 1. split the data into train, validation and test data[1]. * Train data: the model learn from train data. * Validation data: use evaluation result on validation data to fine-tune your model. * Test data: we treat test data as previously unseen data to test your model performance. 2. choose your regression model :::warning Regression mode is used to predict numeric target value e.g. age, sales amount, salary, etc. :bulb: ::: 3. diagnose your model performance * understand evaluation metric: Mean Square Error * recognize overfitting and underfitting problem :::warning Please see the reference links and you'll learn it deeply in the Udemy course:bulb: ::: ![](https://i.imgur.com/xcbVrDC.png) - [ ] Split the table into three parts, train, validation and test data.<br />The details are as belows: 1. Divide the table into train and test datsets. Please set [fraction of rows in the first ouptut dataset] to 0.8 and [random seed] to 5566 ![](https://i.imgur.com/wiZRu2R.png) :::danger the fraction ratio and random seed here can't be changed:mega: ::: 2. Further, split the train data into train and validation data. Please set [fraction of rows in the first ouptut dataset] to 0.8 at first and feel free to adjust it later. - [ ] Use train data to build the regression model. You can use any regression model you prefer. - [ ] Use the model which has already learned from train data to make prediction on validation data, then evaluate and find tune your model based on the performance on validation data. :::info We treat Mean Absolute Error(MAE) metric as our indicator[2], which is the smaller the better. :mega: ::: - [ ] Use the trained model to make prediction on test data and see the final evaluation result. :::info We care more about MAE score on test data because we treat test data as previously unseen data to simulate the performance on production.:mega: - [ ] Compare the **MAE** results between validation and test data. If there is big gap between MAE values on validation and test data, you get **overfitting[3]** problem. If both MAE values are quite large, you get **underfitting[4]** problem. You could do the following steps to improve your results: 1. Create or add more features to make model learn well. 2. Set different fraction ratio in train / valid split node. 3. Try different regression models 4. Select different set of model hyper-parameters :::warning Try above to get the best performance on validation data. Then apply the new model to predict test data and see if it behave well too:bulb: ::: ## Step4. Submission - [ ] Screenshot your **MAE results on test data with your name**. And upload the image to AI Learning Program Wiki page. ![](https://i.imgur.com/SFlFK6P.png) ## Reference 1. Train, validation and test split: https://developers.google.com/machine-learning/crash-course/validation/another-partition 2. Mean Absolute Error(MAE): https://en.wikipedia.org/wiki/Mean_absolute_error 3. Overfitting: https://developers.google.com/machine-learning/glossary/#overfitting 4. Underfitting: https://developers.google.com/machine-learning/glossary/#underfitting