Dataset preparation

# Dataset preparation This section explain the steps needed to reproduce the dataset required for training ## Preparation steps Activate the conda env by running the command: ``` conda activate mobi_new ``` Or pull the docker image containing the env with all requirements from MobiAIRegistry:demo_image1.2 Move the current directory to CRIM_AI ## Update Kedro configuration with a master config file ``` kedro run --pipeline configprep --env $ENV ``` ENV could be one of the following: "abfs", "hdfs", "local_os" ## Importing data For importing data run the following: ``` kedro run --pipeline import_data --env abfs ``` The import configurations related to the range of date of transaction and the source of the data are controlled by the configuration master file ## Data preparation For data preparation run the following: ``` kedro run --pipeline dataprep --env abfs ``` This pipeline will create a data folder ready for training the ML model. The folder ($ {start_date}/$ {end_date}/04_dataset) contains the following data parquet files: - D_forecast_dataset_per_mcc_eval.parquet - D_forecast_dataset_per_mcc_train.parquet - D_forecast_dataset_per_user_train.parquet - D_forecast_dataset_per_user_eval.parquet - RFMLP_score_dataset_per_mcc_train.parquet - RFMLP_score_dataset_per_mcc_eval.parquet and the dataset config file: make_dataset_config.json ## Training models ### Training Xboost model ``` python src/train_xgb.py --path abfs://prosa/Hive/Warehouse/kedro.db/2018-08-01_2018-12-01 --experiment_name 04_datasets --config src/mobi/older_script_to_update/config/python src/train_xgb.py --path abfs://prosa/Hive/Warehouse/kedro.db/2018-08-01_2018-12-01 --experiment_name 04_datasets --config src/mobi/older_script_to_update/config/config_xgboost.json ``` ### Training LSTM model ``` python src/train_seq.py --path abfs://prosa/Hive/Warehouse/kedro.db/2018-08-01_2018-12-01 --experiment_name 04_datasets --config src/mobi/older_script_to_update/config/config_seq_model256.json ``` ## Model Evaluation ### Xgboost Evaluation ``` python src/predict.py --method xgboost --experiment_name 04_datasets --path abfs://prosa/Hive/Warehouse/kedro.db/2018-08-01_2018-12-01 --run_name training_runs/run_xgb_rfmlp_10.08.2022_08h22m28s ``` ### LSTM Evaluation ``` python src/predict.py --method seq --experiment_name 04_datasets --path abfs://prosa/Hive/Warehouse/kedro.db/2018-08-01_2018-12-01 --run_name training_runs/run_seq_15.09.2022_06h22m30s ```