# Kubeflow implementation:add Random Forest algorithm
## Table of contents
[TOC]
## Reference
:::warning
https://towardsdatascience.com/kubeflow-pipelines-how-to-build-your-first-kubeflow-pipeline-from-scratch-2424227f7e5
:::
https://github.com/FernandoLpz/Kubeflow_Pipelines
https://hub.docker.com/r/fernandolpz/only-tests/tags
https://hub.docker.com/repository/docker/lightnighttw/kubeflow/general
## Architecture
:::info
Each block represents a component, and each component is a container.
:::

# Random Forest Image
## Build new folder randomForest

## Dockerfile
```gherkin=
FROM python:3.8-slim
WORKDIR /pipelines
COPY requirements.txt /pipelines
RUN pip install -r requirements.txt
COPY randomforest.py /pipelines
```
## ramdom_forest.yaml
```gherkin=
name: Random Forest classifier
description: Train a random forest classifier
inputs:
- {name: Data, type: LocalPath, description: 'Path where data is stored.'}
outputs:
- {name: Accuracy, type: Float, description: 'Accuracy metric'}
implementation:
container:
image: lightnighttw/kubeflow:random_forest_v4
command: [
python, randomforest.py,
--data,
{inputPath: Data},
--accuracy,
{outputPath: Accuracy},
]
```
## randomforest.py
```gherkin=
import json
import argparse
from pathlib import Path
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
def _randomforest(args):
# Open and reads file "data"
with open(args.data) as data_file:
data = json.load(data_file)
# The excted data type is 'dict', however since the file
# was loaded as a json object, it is first loaded as a string
# thus we need to load again from such string in order to get
# the dict-type object.
data = json.loads(data)
x_train = data['x_train']
y_train = data['y_train']
x_test = data['x_test']
y_test = data['y_test']
# Initialize and train the model
model = RandomForestClassifier(n_estimators=100, criterion = 'gini')
model.fit(x_train, y_train)
# Get predictions
y_pred = model.predict(x_test)
# Get accuracy
accuracy = accuracy_score(y_test, y_pred)
# Save output into file
with open(args.accuracy, 'w') as accuracy_file:
accuracy_file.write(str(accuracy))
if __name__ == '__main__':
# Defining and parsing the command-line arguments
parser = argparse.ArgumentParser(description='My program description')
parser.add_argument('--data', type=str)
parser.add_argument('--accuracy', type=str)
args = parser.parse_args()
# Creating the directory where the output file will be created (the directory may or may not exist).
Path(args.accuracy).parent.mkdir(parents=True, exist_ok=True)
_randomforest(args)
```
## requirements.txt
:::danger
In the new version of pip, sklearn can't be used. You need to use pip install scikit-learn.
:::
```gherkin=
scikit-learn
```
## Upload to Docker Hub
[Install docker](https://www.docker.com)
In ramdom_forest.yaml, the image access location is defined as lightnighttw/kubeflow:random_forest_v4 (which needs to be customized).
[Docker Hub](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjwy-SU35f-AhWRU94KHU9aDq4QFnoECBEQAQ&url=https%3A%2F%2Fhub.docker.com%2F&usg=AOvVaw08oOGeIqomrCmPs9p07hDk)
If you don't have an account, apply for one.
```=
# Download the above four files into a new folder on your computer.
# login
docker login -u <docker username>
# docker build
docker build --platform=linux/amd64 -t <docker-registry-username>/<docker-image-name>:<tag_name> . -f Dockerfile
# 推送到Docker Hub
docker push <docker-registry-username>/<docker-image-name>:<tag_name>
Ex: docker push lightnighttw/kubeflow:random_forest_v4
```
If it doesn't work, use mine [image](https://hub.docker.com/repository/docker/lightnighttw/kubeflow/general)
# Create a pipeline
## pipeline.py
:::info
Import all the information (yaml) of the previous components and compile it into a pipeline yaml file.
:::
```gherkin=
import kfp
from kfp import dsl
from kfp.components import func_to_container_op
@func_to_container_op
def show_results(decision_tree : float, logistic_regression : float, random_forest : float) -> None:
# Given the outputs from decision_tree and logistic regression components
# the results are shown.
print(f"Decision tree (accuracy): {decision_tree}")
print(f"Logistic regression (accuracy): {logistic_regression}")
print(f"Random forest (accuracy): {random_forest}")
def add_resource_constraints(op: dsl.ContainerOp):
return op.set_cpu_request("1").set_cpu_limit("2")
@dsl.pipeline(name='Three Pipeline', description='Applies Decision Tree, random forest and Logistic Regression for classification problem.')
def first_pipeline():
# Loads the yaml manifest for each component
download = kfp.components.load_component_from_file('download_data/download_data.yaml')
decision_tree = kfp.components.load_component_from_file('decision_tree/decision_tree.yaml')
logistic_regression = kfp.components.load_component_from_file('logistic_regression/logistic_regression.yaml')
random_forest = kfp.components.load_component_from_file('randomForest/random_forest.yaml')
# Run download_data task
download_task = add_resource_constraints(download())
# Run tasks "decison_tree" and "logistic_regression" given
# the output generated by "download_task".
decision_tree_task = add_resource_constraints(decision_tree(download_task.output))
logistic_regression_task = add_resource_constraints(logistic_regression(download_task.output))
random_forest_task = add_resource_constraints(random_forest(download_task.output))
# Given the outputs from "decision_tree" and "logistic_regression"
# the component "show_results" is called to print the results.
add_resource_constraints(show_results(decision_tree_task.output, logistic_regression_task.output, random_forest_task.output))
if __name__ == '__main__':
kfp.compiler.Compiler().compile(first_pipeline, 'three_pipelines.yaml')
# kfp.Client().create_run_from_pipeline_func(basic_pipeline, arguments={})
```
## Compile

## upload pipeline
If you success
