# Dataset preparation
This section explain the steps needed to reproduce the dataset required for training
## Preparation steps
Activate the conda env by running the command:
```
conda activate mobi_new
```
Or pull the docker image containing the env with all requirements from MobiAIRegistry:demo_image1.2
Move the current directory to CRIM_AI
## Update Kedro configuration with a master config file
```
kedro run --pipeline configprep --env $ENV
```
ENV could be one of the following: "abfs", "hdfs", "local_os"
## Importing data
For importing data run the following:
```
kedro run --pipeline import_data --env abfs
```
The import configurations related to the range of date of transaction and the source of the data are controlled by the configuration master file
## Data preparation
For data preparation run the following:
```
kedro run --pipeline dataprep --env abfs
```
This pipeline will create a data folder ready for training the ML model. The folder ($ {start_date}/$ {end_date}/04_dataset) contains the following data parquet files:
- D_forecast_dataset_per_mcc_eval.parquet
- D_forecast_dataset_per_mcc_train.parquet
- D_forecast_dataset_per_user_train.parquet
- D_forecast_dataset_per_user_eval.parquet
- RFMLP_score_dataset_per_mcc_train.parquet
- RFMLP_score_dataset_per_mcc_eval.parquet
and the dataset config file: make_dataset_config.json
## Training models
### Training Xboost model
```
python src/train_xgb.py --path abfs://prosa/Hive/Warehouse/kedro.db/2018-08-01_2018-12-01 --experiment_name 04_datasets --config src/mobi/older_script_to_update/config/python src/train_xgb.py --path abfs://prosa/Hive/Warehouse/kedro.db/2018-08-01_2018-12-01 --experiment_name 04_datasets --config src/mobi/older_script_to_update/config/config_xgboost.json
```
### Training LSTM model
```
python src/train_seq.py --path abfs://prosa/Hive/Warehouse/kedro.db/2018-08-01_2018-12-01 --experiment_name 04_datasets --config src/mobi/older_script_to_update/config/config_seq_model256.json
```
## Model Evaluation
### Xgboost Evaluation
```
python src/predict.py --method xgboost --experiment_name 04_datasets --path abfs://prosa/Hive/Warehouse/kedro.db/2018-08-01_2018-12-01 --run_name training_runs/run_xgb_rfmlp_10.08.2022_08h22m28s
```
### LSTM Evaluation
```
python src/predict.py --method seq --experiment_name 04_datasets --path abfs://prosa/Hive/Warehouse/kedro.db/2018-08-01_2018-12-01 --run_name training_runs/run_seq_15.09.2022_06h22m30s
```