# Sensor Project - Transcript ### Day - 07 [Project file setup] Create folders & files - Create conda environment ``` conda create -p env python==3.8 -y ``` ``` conda activate env/ ``` - create setup.py ``` # to see all installed libraries pip list ``` - sensor -> `__init__.py` - requirements.txt -> ` pymongo==4.2.0 ` `-e .` - setup.py updated code Execute setup.py *Assignment: Write code to read requirements.txt file and append each requirements in requirement_list variable.* **Folders & files** - pipeline `__init__.py` - components `__init__.py` - data_access `__init__.py` - cloud_storage `__init__.py` - configuration `__init__.py` - constant `__init__.py` - sensor `logger.py` `exception.py` - entity `__init__.py` `artifact_entity.py` `config_entity.py` - config folder (created) - config > schema.yml (inserted data) * Artifact:- Artifact is a machine learning term that describes the output(a fully trained model, model checkpoint, or a file) created by the training process. - ml `__init__.py` 2:36:00 - 2:43 [Files explanation] * Mongodb altas 3:02:00 better check on course video - Connection issue with the url - Setting it manually in altas and connecting to mongodb compass - Connection for mongodb in configuration file - constant folder > create file as database.py -> DATABASE_NAME="ineuron" - Configuration folder -> mongo_db_connection.py [code it with mongo+srv url] - requirements.txt `pymongo[srv]==4.2.0` - create main.py' `run main.py` working ### Day - 08 [Code explanation part-1][Data Ingestion] ml pipeline [explanation] - 14:11 to 22:11 [high level flow] - Exception code update - main.py (run) - debugging [34:00] - logger code update - list of logging formatter in python(web) - main.py code - logs file created automatically after executing main.py - constant>env_variable.py (code) - >s3_bucket.py (code) - >application.py(code) - done with the constant folder - constant>training_pipeline>__init__.py (code) - config>schema.yaml(created) - schema.yaml (updated) - constant>training_pipeline>__init__.py(code) - flowchart explaination - 1:24 X52:00 main.py (mongoerror)X - entity>config_entity.py(code) - class dataingestionconfig: - data ingestion updated - run main.py - ran successfully - components>data_ingestion.py (created) - pipeline>training_pipeline.py (created) - commit changes. [01:43:53] - pipeline > training_pipeline.py (code updated) - entity > artifact_entity.py (code updated) - main.py (updated and run) - components > data_ingestion.py (code) - data_access > sensor_data.py (created & updated code) - data_access > sensor_data.py folder explanation [2:08] - data_ingestion.py (code updated) - pipeline > training_pipeline.py (code) - components > data_ingestion.py (code) - run main.py - artifact folder created with sensor.csv [2:21:17] - Exported data to feature store.. - train_test_split - requirements.txt - sklearn(scikit-learn==1.1.3) - data_ingestion.py - def split_data_as_train_test - run main.py - check logs - commit changes - Data Ingestion completed. - updated readme file with mongodb url. # Day-9 Code explanation [Part-2] - Data Ingestion revision - Data Validation flowchart explanation - Constant folder > training pipeline > __init__.py - 57:57 - Data Validation - -> data validation dir - -> valid data dir - -> invalid data dir - -> valid train file path - -> invalid train file path - -> invalid test file path - -> drift report file path - constant > update - entity > config_entity.py > data validation class update - entity > artifact.py > update - create data_validation.py > components - components> data_validation.py > update - sensor > utils > main_utils.py > update - utils > __init__.py - requirements.txt > pyyaml - demo.ipynb (created) - components > data_validation.py(update) - evidently ai > library (data drift detection) - TFX - tensorflow data validation(1:18) - scipy.stats.ks_2samp > # flowchart > ## Training pipeline train and test dataset must have same distribution base dataset: Train To compare with: test if same distribution: no drift Solution: if distribution not same then do train test split correctly. > ## Instance Prediction Not possible to detect data drift immediately why It is not possible to summarize one record Solution: You can save each request in database then you can fetch all request by hour/by day Example: collected data of one day will go for data drift detection? train dataset, collected dataset see the data drift report if huge difference go for retraining else you can keep checking data drift every day.