# Sensor Project - Transcript
### Day - 07 [Project file setup]
Create folders & files
- Create conda environment
```
conda create -p env python==3.8 -y
```
```
conda activate env/
```
- create setup.py
```
# to see all installed libraries
pip list
```
- sensor
->
`__init__.py`
- requirements.txt
->
`
pymongo==4.2.0
`
`-e .`
- setup.py updated code
Execute setup.py
*Assignment:
Write code to read requirements.txt file and append each requirements in requirement_list variable.*
**Folders & files**
- pipeline
`__init__.py`
- components
`__init__.py`
- data_access
`__init__.py`
- cloud_storage
`__init__.py`
- configuration
`__init__.py`
- constant
`__init__.py`
- sensor
`logger.py`
`exception.py`
- entity
`__init__.py`
`artifact_entity.py`
`config_entity.py`
- config folder (created)
- config > schema.yml (inserted data)
* Artifact:-
Artifact is a machine learning term that describes the output(a fully trained model, model checkpoint, or a file) created by the training process.
- ml
`__init__.py`
2:36:00 - 2:43 [Files explanation]
* Mongodb altas 3:02:00
better check on course video
- Connection issue with the url
- Setting it manually in altas and connecting to mongodb compass
- Connection for mongodb in configuration file
- constant folder > create file as database.py
-> DATABASE_NAME="ineuron"
- Configuration folder
-> mongo_db_connection.py [code it with mongo+srv url]
- requirements.txt
`pymongo[srv]==4.2.0`
- create main.py'
`run main.py` working
### Day - 08 [Code explanation part-1][Data Ingestion]
ml pipeline [explanation]
- 14:11 to 22:11 [high level flow]
- Exception code update
- main.py (run)
- debugging [34:00]
- logger code update
- list of logging formatter in python(web)
- main.py code
- logs file created automatically after executing main.py
- constant>env_variable.py (code)
- >s3_bucket.py (code)
- >application.py(code)
- done with the constant folder
- constant>training_pipeline>__init__.py (code)
- config>schema.yaml(created)
- schema.yaml (updated)
- constant>training_pipeline>__init__.py(code)
- flowchart explaination
- 1:24 X52:00 main.py (mongoerror)X
- entity>config_entity.py(code)
- class dataingestionconfig:
- data ingestion updated
- run main.py
- ran successfully
- components>data_ingestion.py (created)
- pipeline>training_pipeline.py (created)
- commit changes.
[01:43:53]
- pipeline > training_pipeline.py (code updated)
- entity > artifact_entity.py (code updated)
- main.py (updated and run)
- components > data_ingestion.py (code)
- data_access > sensor_data.py (created & updated code)
- data_access > sensor_data.py folder explanation [2:08]
- data_ingestion.py (code updated)
- pipeline > training_pipeline.py (code)
- components > data_ingestion.py (code)
- run main.py
- artifact folder created with sensor.csv
[2:21:17]
- Exported data to feature store..
- train_test_split
- requirements.txt
- sklearn(scikit-learn==1.1.3)
- data_ingestion.py
- def split_data_as_train_test
- run main.py
- check logs
- commit changes
- Data Ingestion completed.
- updated readme file with mongodb url.
# Day-9 Code explanation [Part-2]
- Data Ingestion revision
- Data Validation flowchart explanation
- Constant folder > training pipeline > __init__.py
- 57:57
- Data Validation
- -> data validation dir
- -> valid data dir
- -> invalid data dir
- -> valid train file path
- -> invalid train file path
- -> invalid test file path
- -> drift report file path
- constant > update
- entity > config_entity.py > data validation class update
- entity > artifact.py > update
- create data_validation.py > components
- components> data_validation.py > update
- sensor > utils > main_utils.py > update
- utils > __init__.py
- requirements.txt > pyyaml
- demo.ipynb (created)
- components > data_validation.py(update)
- evidently ai > library (data drift detection)
- TFX
- tensorflow data validation(1:18)
- scipy.stats.ks_2samp
> # flowchart
> ## Training pipeline
train and test dataset must have same distribution
base dataset: Train
To compare with: test
if same distribution: no drift
Solution: if distribution not same then do train test split correctly.
> ## Instance Prediction
Not possible to detect data drift immediately
why
It is not possible to summarize one record
Solution: You can save each request in database
then you can fetch all request by hour/by day
Example: collected data of one day
will go for data drift detection?
train dataset, collected dataset
see the data drift report
if huge difference
go for retraining
else
you can keep checking data drift every day.